Algorithm Research For Text Information Extraction Based On Hidden Markov Model

Posted on:2005-10-28

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Liu

Full Text:PDF

GTID:2168360125958545

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of the Internet techniques, the information on the Internet increases exponentially. One important research focuses on how to deal with these great capacities of online documents. Text information extraction is a natural language processing task that involves automatically extracting specific types of information from text, such as events and facts, forms structured data, and then populates database slots for queries. Text information extraction is an important method of processing large quantity of text. This thesis mainly studies relative algorithms on text information extraction based on hidden Markov model.This thesis firstly analyzes how to learn model topology structure and concrete parameters for text information extraction. Then, making use of the information of layout and segment lists, an algorithm is proposed using hidden Markov model for information extraction based text blocks. The experimental results show the new algorithm has higher precision and recall. In order to solve the problem that training data is sometimes too multi-component for using statistical learning technique to find optimal model parameters, this thesis proposes a new algorithm for text information extraction using hidden Markov model based on multiple templates, which trains multiple pairs of initial probability and transition probability parameters corresponding to multiple model format templates, combined with the uniform emission probability parameters for information extraction. The experimental results show the new algorithm can improve the precision and recall in some cases. Synthesizing the information of context features and the features possessed by tokens themselves, this thesis proposes a new algorithm for text information extraction using hidden Markov model based on maximum entropy. The experimental results show the new algorithm can improve the precision and recall although it increases the time complexity. Considering the high cost to label the training data manually during the experiment, this thesis uses active hidden Markov model for information extraction. Through setting different thresholds for relative parameters, comparing the user-labeling ratio and information extraction accuracy, then we can select the optimal threshold for active model parameters so that we can lessen the user's workload for labeling while don't affect the performance of text information extraction.

Keywords/Search Tags:

Text Information Extraction, Hidden Markov Model, Text Block, Multiple Templates, Maximal Entropy, Active Learning

PDF Full Text Request

Related items

1	Research And Implementation Of Web Information Extraction Based On Improved Hidden Markov Model
2	Researches On Models And Algorithms Of Text Information Extraction
3	Web Text Information Extraction And Classification
4	Web Free Text Information Extraction Based On TABLE Layout And Hidden Markov Model
5	Research On Error Correction Technology Of Text Recognition Based On Hidden Markov Model
6	Research On Algorithms For Machine Learning And Text Mining
7	Sentence Extraction And Reduction For Indonesian Text Summarization
8	The Algorithm Research Of Chinese Information Extraction Based On The Hidden Markov Model
9	Research Of Web Text Mining Technology Based On Hidden Markov Model
10	Research On Spatial And Temporal Information Extraction In Unstructured Text