Researeh On Web Information Extraction Based On Page Structure Clustering

Posted on:2014-02-03

Degree:Master

Type:Thesis

Country:China

Candidate:H W Liao

Full Text:PDF

GTID:2248330398974098

Subject:Signal and Information Processing

Abstract/Summary:

Web has become the worldâ€™s largest and most complete type mass repository. How to obtain the valuable information from large web automatieally and rapidly is becoming more and more important. The most commonly used language of Web is described in the HTML in this way the rendered page are mostly structured or semi-structured structure. The site is dynamically generated by the data, and topical information is similar in the same template page. For these features of Web page, this thesis proposes a clustering Web information extraction method based on the structure of the page, and designs a prototype system based on this method. The system can classify Web according to structural similarity, generate similar Web extraction rules easily and quickly, and extract the page information rely on the generated rules accurately. The system is divided into three modules:(1) Web pages download module, to achieve efficient Web crawler collection pages;(2) rule learning module, to achieve Web page clustering;(3) information extraction module, to achieve Web information extraction.This thesis studies the structure of Web page and represents Web page into a tree structure with the DOM model firstly. The page structure similarity algorithm is analyzed with the structure. An improved algorithm is proposed and compared with tree edit algorithm and tree path matching algorithm. The hierarchical clustering algorithm with similarity algorithm is used to find similar page. Then, the Web crawler technology and Web pretreatment technology that include web DOM model, web page cleaning and page structure graphical display are studied. Finally, this thesis studies extraction rule representation.The experimental results performed on multiple Web sites show that the method of Web data extraction could extract data records in similar Web pages with high accuracy.

Keywords/Search Tags:

information extraction, DOM (Document Object Model), Web structuresimilarity, Web page clustering

Related items

1	The Research Of Web Pages Information Extraction Based On Page Structure Analysis Technique
2	Research Of Web Page Purifying Method Based On Document Object Model
3	Research On Mining Structure Of WEB Page For Information Extraction
4	Research On Web Article Automatic Extraction Method Based On Page Segmentation
5	Research On Web Information Extraction Based On Clustering Algorithm
6	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework
7	Study On Information Autonomous Extraction Technology Of Web Pages
8	The Adaptive Web Information Extraction Based On Single DOM Tree Characteristics And Classification
9	Research On Web-based Full-station Data Information Extraction Based On Template
10	Web Object Extraction Retrieval System Design And Implementation