Font Size: a A A

Research On Automatic Web Information Extraction Technique

Posted on:2009-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:C B LaiFull Text:PDF
GTID:2178360242482980Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Web has become a large and complex information warehouse. How to get the information from large web automatically and rapidly use program becoming more and more important. One important category of web pages is dynamic pages from data-providing websites, for example, the commodity detail pages of e-commerce websites. These pages usually have a large number and rich in content, so that the extraction is valuable; and always highly structured, contains few liberal text and many unchangeable text, which is different from news pages. In this page, according to the characteristics of these pages, we proposed a set of algorithms for page clustering, template generation, data extraction and data labeling. And developed an automatic information extraction system based on these algorithms to extract structured data from web pages, which can be used in many applications.The foundation of our algorithms is the restricted top-down tree edit distance algorithm. The improved Clustering Use Representatives page clustering method, template generating via pruning by threshold method, including prefix and postfix text template nodes, which remarkablely increases the extraction precision, and automatic extracted data fields annotating method are all proposed by the edit distance. These methods make the whole process automatic.Experimental results from a series of data-providing websites and the comparison with some other web data extraction algorithms show that the extraction technique of this paper approaches a high accuracy.
Keywords/Search Tags:Web data extraction, tree edit distance, template detection, page clustering, data labeling
PDF Full Text Request
Related items