Font Size: a A A

Research And Improvement Of Web Information Extraction Method Based On HMM Model

Posted on:2009-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:B B LiuFull Text:PDF
GTID:2178360272474002Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the internet techniques, the information on the internet increases exponentially. One of important research focuses on how to deal with these great capacities of online documents. As above, General search engine and vertical search engine become the key points of the research. Different from general search engine, vertical search engine provides more focused and structured information. Information extraction is a natural language processing task that involves automatically extracting specific types of information from text, such as events and facts, forms structured data, and then populates database slots for queries. Presently, information extraction technology develops a lot, but the information extraction method based on html template is already the usual method for information extraction in vertical search engine. This method forms a high precision and recall percent, but reduces the flexibility of the extraction system and increase the cost of maintenance.This thesis mainly researches on relative algorithms on Web information extraction based on hidden Markov model (HMM). Web information extraction based on HMM is one of the methods based on machine learning, which can increases the flexibility of the extraction system and reduces the cost of maintenance.This thesis expands the background and history of information extraction, and analyzes the typical technology and system of web information extraction which use machine learning to study characteristics of text. The principle and main algorithm of HMM and second order HMM are also expatiated, such as forward algorithm and backward algorithm for evaluation; Maximum-Likelihood algorithm and Baum-Welch algorithm to mark training samples in the study of the model; Viterbi algorithm for decoding. How to use HMM and how to mark data in text information extraction are discussed. And several methods to improve the hidden Markov model in information extraction are offered. Then, the web information extraction model based on HMM is established.After the comparison and analysis of the output of information extraction model, it is proved that the improvement of the HMM model is effective and achieves the standard of application in vertical search engine.
Keywords/Search Tags:HMM, Information extraction, Machine learning
PDF Full Text Request
Related items