Font Size: a A A

Research And Application Of Automatic Data Extraction From Template-generated Web Pages

Posted on:2010-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:F Q LuFull Text:PDF
GTID:2178360272991580Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the Internet has become a huge information base,in order to use the information of the Internet effectively,a great number of web information extraction technology came into being.At present,a lot of web pages are based on a request from the user to select the data from the background database and embed it in the common template,and generated by the sepecial demands of the Web site dynamically,such as product description pages on e-commerce sites.For how to extract the embedded data from the template detected web pages,the classical methods are as follows:RoadRunner,EXALG and so on.RoadRunner realization algorithm's time complexity increases exponentially,and its practicality is not very strong,although EXALG method has made the improvement to the RoadRunner method,but it still lacks the considerations of the visual layout information,the similarity of string,tag attributes and so on.Therefore,this article has discussed the formalization description of the template detection problem in view of these questions,and analyzed the underlying structure of template-generated pages, then studied a new template detection method;and used the detected templates to extract the data from instance pages automatically;finally,this kind of data extraction algorithm is applied in an e-commerce website,that is,both the list information and the detail information of commodity can be extracted automatically.Comparing with other existing method,the approach is applicable for both "list pages" and "detail pages",experimental results has shown that this approach can achieve a big improvement in the recall and precision of the data extraction.The main content and structure of this paper are arranged as follows:Firstly,the development present situation as well as the correlation technology of the data extraction from template-generated web pages are introduced,and at the same time the purpose and the work of this paper are expounded.Secondly,the most popular web page data extraction technology is introduced.The superiority and the insufficiency of the present widely used and the classics web data extraction technology are analyzed systematically.Because of these insufficiencies,a new kind of effective template-generated web pages data extraction method and the realization algorithm is studied.By inputting one kind of similar web pages,this paper has completed the work of extracting the valid data automatically from the corresponding web pages. Thirdly,this paper is focused on the design and implementation of automatic data extraction algorithm from template-generated web pages,this algorithm would first parse the purgative HTML pages into a token tree data structure and token sequence data structure;Second,for most of the pages that include navigation,advertising, versions of the web sites information and so on,which have nothing to do with the information extraction,so a new effective token matching algorithm is studied to remove these irrelevant / redundant information from web pages;The tokens of HTML pages are classified by calculating the Ctokens,which is the core algorithm of the automatic data extraction algorithms,then the page templates are constructed and the valid data is extracted automatically from web pages in the field level by the already generated Ctokens.Finally,a prototype system has been tried to construct to extract data automatically from template-generated web pages according to the studied method and algorithm of this paper.This system can achieve the work of the effective automatic data extraction from some pages(Such as:the "list page" and "detail page" of commodities) of the e-commerce web sites.The recall and the accuracy of the data extraction process has a big improvement,the completion of the work has the widespread physical demand and the thorough promotion application value.
Keywords/Search Tags:Web Information Extraction Technology, Page Template, Token Tree Matching Algorithm, Ctokens
PDF Full Text Request
Related items