Font Size: a A A

Algorithm Research For Text Information Extraction Based On Hidden Markov Model

Posted on:2005-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z LiuFull Text:PDF
GTID:2168360125958545Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet techniques, the information on the Internet increases exponentially. One important research focuses on how to deal with these great capacities of online documents. Text information extraction is a natural language processing task that involves automatically extracting specific types of information from text, such as events and facts, forms structured data, and then populates database slots for queries. Text information extraction is an important method of processing large quantity of text. This thesis mainly studies relative algorithms on text information extraction based on hidden Markov model.This thesis firstly analyzes how to learn model topology structure and concrete parameters for text information extraction. Then, making use of the information of layout and segment lists, an algorithm is proposed using hidden Markov model for information extraction based text blocks. The experimental results show the new algorithm has higher precision and recall. In order to solve the problem that training data is sometimes too multi-component for using statistical learning technique to find optimal model parameters, this thesis proposes a new algorithm for text information extraction using hidden Markov model based on multiple templates, which trains multiple pairs of initial probability and transition probability parameters corresponding to multiple model format templates, combined with the uniform emission probability parameters for information extraction. The experimental results show the new algorithm can improve the precision and recall in some cases. Synthesizing the information of context features and the features possessed by tokens themselves, this thesis proposes a new algorithm for text information extraction using hidden Markov model based on maximum entropy. The experimental results show the new algorithm can improve the precision and recall although it increases the time complexity. Considering the high cost to label the training data manually during the experiment, this thesis uses active hidden Markov model for information extraction. Through setting different thresholds for relative parameters, comparing the user-labeling ratio and information extraction accuracy, then we can select the optimal threshold for active model parameters so that we can lessen the user's workload for labeling while don't affect the performance of text information extraction.
Keywords/Search Tags:Text Information Extraction, Hidden Markov Model, Text Block, Multiple Templates, Maximal Entropy, Active Learning
PDF Full Text Request
Related items