Font Size: a A A

Web Information Extraction Research Based On Conditonal Random Fields

Posted on:2011-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:D H ZhuFull Text:PDF
GTID:2178360308477373Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With? the rapidly increasing of Internet resources, people's demands to information can?not be satisfied only by the Internet explorer and the retrieve based on keywords using search engineers. Web information extraction is generated under this situation. In this thesis, I mainly studyed the conditional random fields (CRF), I studyed two novel conditional random fields, and then I proposed a Web information extraction system based on the model which is suitable for dealing with such problems. I did the following contributions:In view of many researches are focusing on linear chain conditional random fields, but seldom research the CRFs with complex structure, this paper studyed two novel families of CRFs standing on the predecessor's research, I did a systematic research about the formalization and algorithm of these models, which set a solid foundation of theoretical principle for the rest of my work.Compared to HMM, CRF can add long-distance, overlap features. Though CRF has been popular applied domestically, but only limit to the linear chain CRF under the Markov assumption, this kind of model can not represent the long dependencies between nodes, so there are few studies about how the long distance features influence the extraction performance. I did especial research about this problem, and then proposed long distance dependencies conditional random fields, I also did the web text information extraction experiment based on this model, the experiment results show long distance features do make great contribution to the improvement of the model's performance.Recent years, there are more and more researches using statistical models to extract Web information, usually there are two general limits in these models, first, these models are poor in scalability, feature functions are always build in the source code, it is impossible for people who have different extraction demands to customize their own feature functions. Second, these models need large training Web pages to reach good performance, thus generating training pages is a big waste of time and energy. This thesis combined the scalability and convenient exchange of xml and proposed xml conditional random fields (XCRF). XCRF put the labels and feature functions in an xml file which is independent from the source code, features are expressed by XPATH expression, thus it is very easy and convenient for common people to customize their own features without knowing too much about the source code. what is more, I especially made triangle features for xml structured document, because this kind of feature are good at representing the hierarchy of Web page, this lead to a drastically decrease demands for training Web pages. At last, I did Web information extraction based on XCRF, of which the experimental results tell us XCRF is suitable for Web information extraction.
Keywords/Search Tags:Web Information Extraction, Conditional Random Fields, Long Distance Dependencies, XML, XCRF
PDF Full Text Request
Related items