Research And Improvement Of Web Information Extraction Method Based On HMM Model

Posted on:2009-04-28

Degree:Master

Type:Thesis

Country:China

Candidate:B B Liu

Full Text:PDF

GTID:2178360272474002

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of the internet techniques, the information on the internet increases exponentially. One of important research focuses on how to deal with these great capacities of online documents. As above, General search engine and vertical search engine become the key points of the research. Different from general search engine, vertical search engine provides more focused and structured information. Information extraction is a natural language processing task that involves automatically extracting specific types of information from text, such as events and facts, forms structured data, and then populates database slots for queries. Presently, information extraction technology develops a lot, but the information extraction method based on html template is already the usual method for information extraction in vertical search engine. This method forms a high precision and recall percent, but reduces the flexibility of the extraction system and increase the cost of maintenance.This thesis mainly researches on relative algorithms on Web information extraction based on hidden Markov model (HMM). Web information extraction based on HMM is one of the methods based on machine learning, which can increases the flexibility of the extraction system and reduces the cost of maintenance.This thesis expands the background and history of information extraction, and analyzes the typical technology and system of web information extraction which use machine learning to study characteristics of text. The principle and main algorithm of HMM and second order HMM are also expatiated, such as forward algorithm and backward algorithm for evaluation; Maximum-Likelihood algorithm and Baum-Welch algorithm to mark training samples in the study of the model; Viterbi algorithm for decoding. How to use HMM and how to mark data in text information extraction are discussed. And several methods to improve the hidden Markov model in information extraction are offered. Then, the web information extraction model based on HMM is established.After the comparison and analysis of the output of information extraction model, it is proved that the improvement of the HMM model is effective and achieves the standard of application in vertical search engine.

Keywords/Search Tags:

HMM, Information extraction, Machine learning

PDF Full Text Request

Related items

1	Information Extraction Of Chinese Biodiversity Document Based On Machine Learning
2	Research On Deep Learning Technology For Information Extraction Applications
3	Machine learning for information extraction in informal domains
4	Literature Information Extraction System From Academic Homepage
5	Information Extraction And Information Visualization Based On Conditional Random Fields
6	An Information Extraction System For DynamicView
7	Research And Improvement Of Web Information Extraction Method Based On HMM Model
8	Research And Realization On The Key Technologies Of Chinese Information Extraction
9	Machine learning for information extraction
10	Research And Application Of Machine Learning In Gesture Recognition