Font Size: a A A

Research And Implementation Of Web Information Extraction Based On Improved Hidden Markov Model

Posted on:2017-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:Z ShuangFull Text:PDF
GTID:2308330485969065Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet technology has led to an exponential growth of online data, while also marks the arrival of the age of big data, in the mean time, people have also create a large number of semi-structure or unstructured data, and the meaning of information extraction technology is to obtain the target information accurately and quickly from a large amount of data, and to further improve the utilization of information. Therefore an automated tool is needed to help people quickly find the real needs from mass data, furthermore those information will be automatically classified, extracted and restructured, so that it will be beneficial to the follow-up examination and automatic processing, which requires the corresponding mature information extraction technology. However, there are many problems in this field, such as:lack of performance of IE, automation is not enough (training data collection and labeling requires a lot of manual operation), applicable scope is limited, the lack of portability.In this paper, we analyze the problems and deficiencies in the process of building Hidden Markov Models in the field of information extraction, an improved HMM is proposed by combining the advantage of Maximum Entropy (ME) model in the representation of feature knowledge. We construct the forward and backward dependency assumption in the HMM, and the model parameters are adjusted by using the characteristic of the emission unit and the context information. The state transition probability and the output probability of the improved HMM are not only dependent on the current state of the model, but also be corrected by the forward and backward state values of the historical state of the model.In this paper we propose a web information extraction method that using improved Hidden Markov Model by incorporating the characteristics of web data. In this paper, we use the properties of similar or related content gather together in the web page, and use web content block as the basic extraction unit(emission unit); we detect state transition order based on web page layout structure that use the VIPS algorithm; we use multiple emission features(semantic terms, layout, format) instead of single emission feature(semantic terms), and obtain the formula of the calculation the observation emission probability.
Keywords/Search Tags:Hidden Markov Model, Maximal Entropy, Web Information Extraction, Web Content Block
PDF Full Text Request
Related items