Research And Implementation Of Web Information Extraction Based On Improved Hidden Markov Model

Posted on:2017-01-28

Degree:Master

Type:Thesis

Country:China

Candidate:Z Shuang

Full Text:PDF

GTID:2308330485969065

Subject:Computer software and theory

Abstract/Summary:

With the rapid development of the Internet technology has led to an exponential growth of online data, while also marks the arrival of the age of big data, in the mean time, people have also create a large number of semi-structure or unstructured data, and the meaning of information extraction technology is to obtain the target information accurately and quickly from a large amount of data, and to further improve the utilization of information. Therefore an automated tool is needed to help people quickly find the real needs from mass data, furthermore those information will be automatically classified, extracted and restructured, so that it will be beneficial to the follow-up examination and automatic processing, which requires the corresponding mature information extraction technology. However, there are many problems in this field, such as:lack of performance of IE, automation is not enough (training data collection and labeling requires a lot of manual operation), applicable scope is limited, the lack of portability.In this paper, we analyze the problems and deficiencies in the process of building Hidden Markov Models in the field of information extraction, an improved HMM is proposed by combining the advantage of Maximum Entropy (ME) model in the representation of feature knowledge. We construct the forward and backward dependency assumption in the HMM, and the model parameters are adjusted by using the characteristic of the emission unit and the context information. The state transition probability and the output probability of the improved HMM are not only dependent on the current state of the model, but also be corrected by the forward and backward state values of the historical state of the model.In this paper we propose a web information extraction method that using improved Hidden Markov Model by incorporating the characteristics of web data. In this paper, we use the properties of similar or related content gather together in the web page, and use web content block as the basic extraction unit(emission unit); we detect state transition order based on web page layout structure that use the VIPS algorithm; we use multiple emission features(semantic terms, layout, format) instead of single emission feature(semantic terms), and obtain the formula of the calculation the observation emission probability.

Keywords/Search Tags:

Hidden Markov Model, Maximal Entropy, Web Information Extraction, Web Content Block

Related items

1	Algorithm Research For Text Information Extraction Based On Hidden Markov Model
2	The Algorithm Research Of Chinese Information Extraction Based On The Hidden Markov Model
3	Application Research Of Hidden Markov Model In Information Extraction
4	The Contourlet-based Statistical Models For SAR Images Denoising
5	Research On Heterogeneous Academic Information Extraction And Aggregation Based On Web
6	Web Text Information Extraction And Classification
7	Research On WLAN Indoor Location Alogrithm Based On Information Entropy
8	Research On Domain Entity Attribute And Event Extraction Technology
9	Web Free Text Information Extraction Based On TABLE Layout And Hidden Markov Model
10	Personalized Recommendation Algorithm Research Based On Incomplete Information Feature Extraction And Hidden Markov Model