Font Size: a A A

Research For Information Extraction Based On Wrapper Model Algorithm

Posted on:2010-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y LiFull Text:PDF
GTID:2178360275988908Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Information extraction from websites is nowadays a relevant problem, We can consider the Web as the largest"knowledge base"ever developed and made available to the public. However HTML sites are in some sense modern legacy systems, since such a large body of data cannot be easily accessed and manipulated. The reason is that Web data sources are intended to be browsed by humans, and not computed over by applications. XML, which was introduced to overcome some of the limitations of HTML, has been so far of little help in this respect. As a consequence, extracting data from Web pages and making it available to computer applications remains a complex and relevant task.We present a novel approach to information extraction from websites. That is before extraction the HTML page of information, to remove the noise on the HTML page, and then de-noising of the page to the page tree the subject of information extraction.The main contributions stand two algorithms. One is for the noise of information extraction based vision. This matching algorithm in the WEB page, based on the use of the first noise analysis and design, the content of the extracted first noise removal processing, the algorithm will be based on the visual DOM tree a combination of matching algorithms, based on visual de-noising of the DOM tree, so as to enhance the efficiency of extraction. The other is DOM tree algorithm and based on the page wrapper tree generation algorithm. Examples are given to detect and resolve the page selection mode and iterative mode method does not match. Through experiments, the page tree matching algorithm. Match the specific step-by-step approach, first to deal with characters and choices do not match the primary does not match the generated template, and then, after dealing with iterative generation of the ultimate does not match the template. The study does not match the problem to solve and found the abstract in an orderly manner and generate the wrapper tree algorithm that can automatically extract data, does not require human intervention.
Keywords/Search Tags:Information Extraction, Wrapper, DOM tree, Match Technology
PDF Full Text Request
Related items