Font Size: a A A

Study On Key Technology Of GHMM-Based Web Text Information Extraction And System Design

Posted on:2009-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2178360272978157Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, Web has become the world's largest source of information. Therefore, the common problem that everyone faced is how to get these Web informations. The Web Information Extraction (WebIE) is put forward for this problem. Nowadays, the majority of information extraction methods are to deal with plain text, no considering the Web page. On the other hand, information Extraction rarely involved in the understanding of semanticsAt present, Hidden Markov Model (HMM) is commonly used as information extraction model, which is easy to establish and has strong adaptability and high precision, taken the growing concern by the researchers. However, the model is only suitable for plain text, not the Web page which contains more information. Through the analysis of Web page, Web information contains more emission features, such as format, layout and so on. Considering the limitation of traditional HMM that the approaches only consider the semantic term as observed emission feature, we use multiple emission features (term, layout, and formatting) instead of single emission feature (term) as state transition estimation for HMM. Thereby, Generalized Hidden Markov Model (GHMM) is introduced.For plain text, the traditional HMM information extraction model takes single term as basic unit for IE. The supposed sequential state transition order, left to right and then top to bottom, is not suitable for the two-dimensional space Web page. Based on the analysis of web pages, we find that the visual layout structure of Web page is composed of different blocks, and there exists certain logical relations between them. A vision-based page segmentation algorithm is put forward to partition web pages into blocks, which can get a better state transition sequence of the GHMM, and is more applicable to page layout structure of Web page. Due to the fact that the emission probability at any time is not only related to the current state but also the state before, this study presents a novel GHMM based on second-order Markov chain.Moreover, this paper adopts a naming entity recognition method based on role tagging. The basic idea is using improved GHMM to implement role tagging for web text, based on the rule of role table. On the basis of role sequences, strings are recognized, and then naming entity recognition is realized. Finally, Web pages information extraction from both structure and semantics aspects is implemented.Through the analysis of Web page extraction of mass recruitment information on current recruitment websites, we independently developed prototype system named GHMM-Based Web text Information Extraction System (WebIE). In this paper, firstly, the basic concepts of Web information extraction are introduced. Secondly, the improved GHMM is adopted to extract Web information from both Web page structure and semantics aspects, based on the analysis of Web pages and the usage of role tagging-based naming recognition technology. Finally, the results of retrieval experiments show it is an efficient system, and shortcomings and future research directions are put forward.
Keywords/Search Tags:Data Mining, Information Extraction, Generalized Hibben Markov Model(GHMM), Named Entity Identification
PDF Full Text Request
Related items