Research And Implement Of Web Information Extraction Based On XML Elements Processed

Posted on:2010-12-06

Degree:Master

Type:Thesis

Country:China

Candidate:Q X Meng

Full Text:PDF

GTID:2178360275973660

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the continuous development of Internet technology,Internet has replaced television,radio,newspapers and other traditional media,become the most important means of information acquisition of daily lives.There are tens of thousands of Web pages on Internet,which have a mass of information.People continuously research that how can we obtain the needed information from these Web pages.Web Information Extraction has become an issue with important research significance.In this paper,I research and analyse the existing Web information extraction technology,to identify their advantages and weaknesses.Then with improvements and integration of existing technologies,I propose a new method of information extraction of the main information of the Web pages based on XML elements processed.I also research and implement the method.The main tasks of this paper include the following three aspects:First of all,I propose method of pre-processing HTML document and define the core data structure and functions which system needed.When making the rules of Web information extraction,I define the needed operations and needed variables as XML elements and write them into the XML configuration file.System completes operations of Web information extraction by loading definited various XML element processors and using a mode of pipeline executing.Secondly,in this paper I propose the concept of path weights of DOM nodes according to the document's DOM tree structure.Then I design an algorithm of generating path of information based on calculating the path weights of DOM tree's nodes.The algorithm calculates the path weights of each subtree's non-leaf nodes and selects some nodes which have larger path weights by comparing.The sequence composed by these nodes in each level of a subtree is the information path.I also research the integration between extracted results and database.Finally,this paper also tests the performance of the system and analyse the results. Test is divided into two aspects:On the one hand,to verify that after pre-processing with the source HTML document time cost on system executing is less than that without pre-processing.I also analyse the time complexity of information extraction of Web pages.On the other hand,this paper test with various data-rich Web pages and analyse using evaluation standard of information extraction.The results show that the method proposed in this paper is very effective.The method has high precision rate and recall rate.

Keywords/Search Tags:

Web Information Extraction, XML Element, DOM, Path Weight, Information Integration

PDF Full Text Request

Related items

1	Research On Key Issues Of Web Information Integration Oriented Web Information Extraction
2	Research On Web Information Extraction For Domain In Information Integration System
3	Research Of Web Information Extraction Based On XML
4	Research On Web Information Extraction In Information Integration
5	Research On Key Technologies Of Multi Source Information Integration For Joint Operations
6	Automatic Extraction And Integration For Academic Achievement
7	Web Information Extraction And Integration Research Based On XML
8	Study On Information Extraction Technology In Web Pages Of Review
9	Research Of Information Describing Approach In The Heterogeneous Information Integration
10	Book Website Information Integration System Construction