Font Size: a A A

Research And Implement Of Web Information Extraction Based On XML Elements Processed

Posted on:2010-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:Q X MengFull Text:PDF
GTID:2178360275973660Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet technology,Internet has replaced television,radio,newspapers and other traditional media,become the most important means of information acquisition of daily lives.There are tens of thousands of Web pages on Internet,which have a mass of information.People continuously research that how can we obtain the needed information from these Web pages.Web Information Extraction has become an issue with important research significance.In this paper,I research and analyse the existing Web information extraction technology,to identify their advantages and weaknesses.Then with improvements and integration of existing technologies,I propose a new method of information extraction of the main information of the Web pages based on XML elements processed.I also research and implement the method.The main tasks of this paper include the following three aspects:First of all,I propose method of pre-processing HTML document and define the core data structure and functions which system needed.When making the rules of Web information extraction,I define the needed operations and needed variables as XML elements and write them into the XML configuration file.System completes operations of Web information extraction by loading definited various XML element processors and using a mode of pipeline executing.Secondly,in this paper I propose the concept of path weights of DOM nodes according to the document's DOM tree structure.Then I design an algorithm of generating path of information based on calculating the path weights of DOM tree's nodes.The algorithm calculates the path weights of each subtree's non-leaf nodes and selects some nodes which have larger path weights by comparing.The sequence composed by these nodes in each level of a subtree is the information path.I also research the integration between extracted results and database.Finally,this paper also tests the performance of the system and analyse the results. Test is divided into two aspects:On the one hand,to verify that after pre-processing with the source HTML document time cost on system executing is less than that without pre-processing.I also analyse the time complexity of information extraction of Web pages.On the other hand,this paper test with various data-rich Web pages and analyse using evaluation standard of information extraction.The results show that the method proposed in this paper is very effective.The method has high precision rate and recall rate.
Keywords/Search Tags:Web Information Extraction, XML Element, DOM, Path Weight, Information Integration
PDF Full Text Request
Related items