Font Size: a A A

Research And Implementation Of Dom-tree Based Entity Extraction For Deep Web

Posted on:2009-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:D LiFull Text:PDF
GTID:2198360308478568Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The wide spread of the Internet has caused exponential increase in the amount of searchable information on the Web. Deep Web usually refers to the part whose information is stored in Web databases and can not be retrieved by hyperlink but by some dynamic pages techniques. Some statistics indicates both the scale of information and the extent of access for Deep Web are stronger than Surface Web. Therefore, with the increase of Web databases, accessing Deep Web is becoming the main method to acquire information. Mostly the query result returned by Deep Web is represented by HTML pages which are heterogeneous and unstructured. Thus it is important to solve the problem of extracting the valuable data from Web pages. The goal of entity extraction for Deep Web is to extract accurately the entities from result pages and show their information in a structure way.In this thesis, by analyzing the characteristics of result pages, a DOM-tree based Deep Web Entity Extraction Mechanism (D-EEM) is presented to solve the problems about entity extraction for Deep Web. Here our work includes the following major aspects:(1) By combining the demands of both manual entity extraction and automatic entity extraction, the hierarchy model of D-EEM is presented, which includes information collection level, entity extraction level and external representation level. With this model D-EEM can solve the problems about region location, rules generating and semantic annotating.(2) An automatic entity extraction strategy is presented to determine data regions and entity regions respectively, which can improve the accuracy of extraction by considering the textual content and hierarchical structure in DOM-trees effectively. Also based on Web context and co-occurrence of the extracted result and global schemas, a semantic annotation method is proposed to allocate semantics for the extracted result.(3) The prototype system of D-EEM is designed and implemented. On the one hand, a graph user interface is provided to make user set the extraction template by manual. On the other hand, a DOM-tree based entity extraction strategy is implemented to extract entities automatically.(4) An experimental study is proposed to determine the feasibility and effectiveness of the key techniques of D-EEM. Compared with various entity extraction strategies, our approach is superior in the accuracy and efficiency.
Keywords/Search Tags:entity extraction, DOM-tree, Deep Web, data region location, entity region location
PDF Full Text Request
Related items