Research For Information Extraction Based On Wrapper Model Algorithm

Posted on:2010-09-16

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Li

Full Text:PDF

GTID:2178360275988908

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Information extraction from websites is nowadays a relevant problem, We can consider the Web as the largest"knowledge base"ever developed and made available to the public. However HTML sites are in some sense modern legacy systems, since such a large body of data cannot be easily accessed and manipulated. The reason is that Web data sources are intended to be browsed by humans, and not computed over by applications. XML, which was introduced to overcome some of the limitations of HTML, has been so far of little help in this respect. As a consequence, extracting data from Web pages and making it available to computer applications remains a complex and relevant task.We present a novel approach to information extraction from websites. That is before extraction the HTML page of information, to remove the noise on the HTML page, and then de-noising of the page to the page tree the subject of information extraction.The main contributions stand two algorithms. One is for the noise of information extraction based vision. This matching algorithm in the WEB page, based on the use of the first noise analysis and design, the content of the extracted first noise removal processing, the algorithm will be based on the visual DOM tree a combination of matching algorithms, based on visual de-noising of the DOM tree, so as to enhance the efficiency of extraction. The other is DOM tree algorithm and based on the page wrapper tree generation algorithm. Examples are given to detect and resolve the page selection mode and iterative mode method does not match. Through experiments, the page tree matching algorithm. Match the specific step-by-step approach, first to deal with characters and choices do not match the primary does not match the generated template, and then, after dealing with iterative generation of the ultimate does not match the template. The study does not match the problem to solve and found the abstract in an orderly manner and generate the wrapper tree algorithm that can automatically extract data, does not require human intervention.

Keywords/Search Tags:

Information Extraction, Wrapper, DOM tree, Match Technology

PDF Full Text Request

Related items

1	Research And Implementation Of Page Object Extraction Model For Vectical Search Engine
2	Research Of A Suffix Tree Based Automatic Wrapper Generation Method
3	Research On Automatic And Efficient Technologies For Web Information Extraction
4	Web Information Extraction Technology Applied Research, Competitive Intelligence Platform In The Enterprise
5	Research And Implementation Of Automatic Information Extraction From Web Pages
6	Web Page Attribute Extraction Method Research
7	Algorithm Research For Text Information Extraction Based On Wrapper Model
8	A Web News Extraction Method Based On Filtering Noise Wrapper
9	Research Of Data Extraction Technology Based On Tag Tree From List Pages
10	Research On Web Information Extraction Based On Script Code And Local Data Matching