Font Size: a A A

Researches On Models And Algorithms Of Text Information Extraction

Posted on:2008-01-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:S X ZhouFull Text:PDF
GTID:1118360242965181Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Since 1960s, the research of text information extraction has been given extensive attention by researchers at home and abroad as an important research branch of Natural Language Processing. It has been developed gradually and acquired many achievements. However, there are many key problems in text information extraction, for example, the inferior performance of text information extraction, the feeble applicability of extraction model, the enormous manual marking workload of training text and so on, which will be resolved by further research. In this dissertation, the text information extraction models and algorithms have been researched through the methods of rules and statistics so as to further enhance the extraction performance, improve the applicability of models, reduce the dependency on manual marking text in model training, increase the ability of active learning, and solve the key problems in text information extraction. The main achievements of this dissertation are as follows:(1) A novel wrapper induction algorithm has been proposed on the basis of analysis of two types of algorithms based on page landmark and text pattern. Novel algorithm can take the advantage of two types of algorithms. It can not only locate the information using the landmark information of web pages, but also can use the text pattern to the extraction and filtration of extraction results. So, the novel algorithm can achieve higher information extraction accuracy and stronger information expression capacity.(2) In order to make the wrapper suit to the information extraction for changed Web pages automatically so as to resolve the issue that the changed Web pages lead the wrapper to become invalid. A novel wrapper maintenance algorithm based on page features has been proposed. It is based on the observation that despite various changes of pages, many important page features are preserved, such as text patterns, annotations, and hyperlinks. First, the novel algorithm learns the Web pages features such as text patterns, annotations, and hyperlinks from training examples ingathered under normal conditions. Then, it uses these preserved page features to identify the locations of the desired values in the changed Web pages so as to repair invalid wrappers automatically. Experiments over several real Web sites show that the novel algorithm can effectively maintain wrappers.(3) An algorithm of text information extraction based on clustering HMM has been proposed. The form is dissimilar for texts from different resource of network. In former methods, the optimal model was commonly difficult to obtain by hybrid training with all texts, and the performance of extraction was influenced. Clustering has been considered to apply to text information extraction in this dissertation. First, the approach of K-mean for clustering has been improved to enhance the performance of clustering. Then, clustering has been given to Markov Chains of training texts through the improved approach of K-mean, and every different HMM has been trained out through every cluster. Finally, every model has been used for the extraction, and the best result has been obtained through compare from all extraction results. Simulation experiment results show that the novel model and algorithm possesses wonderful applicability and higher performance in information extraction for texts from different resource.(4) The information entropy model in text information extraction based on HMM has been studied. First, considering the effect of features to enhance the performance of information extraction, an algorithm of text information extraction based on HMM using maximum entropy has been proposed. The contextual features and the features included in text vocabularies have been added in model training and information extraction, and the performance of information extraction has been enhanced by using maximum entropy. Then, mutual information model has been applied to text information extraction based on HMM in order to extract the key information from a huge length text. In this method, the transfer probability between the disjunctive states in HMM of text information has been fixed quantity defined through point-wise mutual information. So, it has been implemented to extract the key information from text with fairly good effect.(5) The second-order HMM used for text information extraction has been studied. In the first-order HMM, there is the hypothesis that the transition probability of state and the output probability of observation are only dependent on the current state of the model, which debases the precision of information extraction comparatively. The relationship between the probability and the model's historical states is considered reasonably in the second-order HMM, which has stronger performance of recognition for incorrect information. Based on the ML (Maximum Likelihood) algorithm of the first-order model, the ML algorithm of the second-order model has been inferred. After that, an algorithm of text information extraction based on the second-order HMM has been proposed. The validity of the second-order HMM in information extraction has been analyzed. Simulation experiment has been tried on text information extraction. Results show that the novel algorithm has higher precision than the algorithm based on the first-order HMM.(6) Additional, the extraction approach combining the maximum entropy model and the second-order HMM has been studied. In the information extractin based on the second-order HMM, although the performance of recognition for incorrect information and the correctness of information extraction have been improved, the recall has not been improved. So, in this approach, the contextual features have been added in text information extraction based on the second-order HMM through the maximum entropy model. As a result, the performance of the second-order HMM and the correctness of the text information extraction have been further improved, and the recall has been also improved.(7) Moreover, an active algorithm for text information extraction has been proposed. This algorithm can select out the most valuable training texts to label via active learning while only part of labeled training texts are available. It can be used for text information extraction based on wrapper model and HMM. It can also reduce the dependency on labeled training texts in model training so as to lessen the manual marking workload without affecting the performance of text information extraction.To sum up, in this dissertation, the text information extraction models and algorithms have been studied through the methods of rules and statistics as the above-mentioned aspects. Several key problems in the text information extraction have been resolved. The accuracy and recall of text information extraction have been enhanced. The applicability of models for texts with dissimilar form and the changed Web pages has been improved. The ability of active learning has been increased. So, the dependency on labeled training texts in model training and the manual marking workload are reduced. Some achievements have been obtained.
Keywords/Search Tags:text information extraction, the wrapper model, the HMM model, text clustering, information entropy model, active learning
PDF Full Text Request
Related items