Research And Implementation Of Dom-tree Based Entity Extraction For Deep Web

Posted on:2009-03-28

Degree:Master

Type:Thesis

Country:China

Candidate:D Li

Full Text:PDF

GTID:2198360308478568

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The wide spread of the Internet has caused exponential increase in the amount of searchable information on the Web. Deep Web usually refers to the part whose information is stored in Web databases and can not be retrieved by hyperlink but by some dynamic pages techniques. Some statistics indicates both the scale of information and the extent of access for Deep Web are stronger than Surface Web. Therefore, with the increase of Web databases, accessing Deep Web is becoming the main method to acquire information. Mostly the query result returned by Deep Web is represented by HTML pages which are heterogeneous and unstructured. Thus it is important to solve the problem of extracting the valuable data from Web pages. The goal of entity extraction for Deep Web is to extract accurately the entities from result pages and show their information in a structure way.In this thesis, by analyzing the characteristics of result pages, a DOM-tree based Deep Web Entity Extraction Mechanism (D-EEM) is presented to solve the problems about entity extraction for Deep Web. Here our work includes the following major aspects:(1) By combining the demands of both manual entity extraction and automatic entity extraction, the hierarchy model of D-EEM is presented, which includes information collection level, entity extraction level and external representation level. With this model D-EEM can solve the problems about region location, rules generating and semantic annotating.(2) An automatic entity extraction strategy is presented to determine data regions and entity regions respectively, which can improve the accuracy of extraction by considering the textual content and hierarchical structure in DOM-trees effectively. Also based on Web context and co-occurrence of the extracted result and global schemas, a semantic annotation method is proposed to allocate semantics for the extracted result.(3) The prototype system of D-EEM is designed and implemented. On the one hand, a graph user interface is provided to make user set the extraction template by manual. On the other hand, a DOM-tree based entity extraction strategy is implemented to extract entities automatically.(4) An experimental study is proposed to determine the feasibility and effectiveness of the key techniques of D-EEM. Compared with various entity extraction strategies, our approach is superior in the accuracy and efficiency.

Keywords/Search Tags:

entity extraction, DOM-tree, Deep Web, data region location, entity region location

PDF Full Text Request

Related items

1	Research And Implementation Of DOM-Tree Based Entity Extraction For Deep Web
2	Real-time Entity Resolution And Query Processing Based On Region-tree Indexing
3	Research On Key Techniques Of Entity Search For Deep Web
4	Non-ferrous Metal Retrieval Key Technology Research Entity In The Field
5	Research On Chinese Named Entity And Entity Relationship Extraction
6	Research On Entity Information Extraction And Recognition On Deep Web
7	Research On The Techniques Of Entity Identity On XML Data
8	Research On LTE Indoor Location Based On Region Division
9	English Entity Answer Extraction And Home Find
10	Research On KNN Query And Join Method In Location Service Based On Entity Density