Research On Automatic Web Information Extraction Technique

Posted on:2009-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:C B Lai

Full Text:PDF

GTID:2178360242482980

Subject:Software engineering

Abstract/Summary:

Web has become a large and complex information warehouse. How to get the information from large web automatically and rapidly use program becoming more and more important. One important category of web pages is dynamic pages from data-providing websites, for example, the commodity detail pages of e-commerce websites. These pages usually have a large number and rich in content, so that the extraction is valuable; and always highly structured, contains few liberal text and many unchangeable text, which is different from news pages. In this page, according to the characteristics of these pages, we proposed a set of algorithms for page clustering, template generation, data extraction and data labeling. And developed an automatic information extraction system based on these algorithms to extract structured data from web pages, which can be used in many applications.The foundation of our algorithms is the restricted top-down tree edit distance algorithm. The improved Clustering Use Representatives page clustering method, template generating via pruning by threshold method, including prefix and postfix text template nodes, which remarkablely increases the extraction precision, and automatic extracted data fields annotating method are all proposed by the edit distance. These methods make the whole process automatic.Experimental results from a series of data-providing websites and the comparison with some other web data extraction algorithms show that the extraction technique of this paper approaches a high accuracy.

Keywords/Search Tags:

Web data extraction, tree edit distance, template detection, page clustering, data labeling

Related items

1	Web Information Extracting Based On Tree Edit Distance
2	The Research Of Semi-structured Web Pages Information Extraction
3	Research And Application Of Automatic Data Extraction From Template-generated Web Pages
4	Research On Web-based Full-station Data Information Extraction Based On Template
5	Workflow Application Of Clustering Tree Edit Distance
6	Storage Optimization And Tree Vertical Merging Algorithm Of Tai Tree Editing Distance Algorithm
7	Between The Different Types Of Data Clustering Algorithm
8	Design And Implementation Of Mass Webpage Labeling Nalysis System
9	Research On Web Data Extraction Based On Web Page Structure
10	Research Of Data Extraction Technology Based On Tag Tree From List Pages