Research On Mining Structure Of WEB Page For Information Extraction

Posted on:2011-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:J Liu

Full Text:PDF

GTID:2178330338481049

Subject:Computer Science and Technology

Abstract/Summary:

Information extraction is an important technology of extracting valuable information and knowledge from massive Web pages, within which Web page structure mining and extraction is a key step. However, most of existing page structure mining algorithms rely on heuristic rules or manually labeling, which makes that either the efficiency or the scalability can not meet the requirements of practical application for the massive and heterogeneous Web pages. Thereforef, the development of information extraction application urgently requires more intelligent, automated technology of page structure mining.Based on the above background, we analyze and study two key technologies for Web page structure mining, i.e., page clustering technology and page segment technology. We have found that traditional methods make use of tages very heusitisticaly. Aimed at the fact, we propose statistical information based tag vector which provides solid technical foundation for the page clustering algorithm and page segment algorithms in this paper. Our main contributions are as follows:1. Matrix structure based page clustering algorithm (MSPC). The MSPC algorithm treats all pages as matrices of the same size, and its computational complexity is only affected by the sorting algorithms. We have not only proved that MSPC is a fast algorithm in theory, but also demonstrated MSPC is more effective in clustering web pages than traditional clustering algorithms that have the same time compleixty.2. Graph and Statistic Based Page Segment (GSPS). GSPS discards a series of semi-supervision or supervision methods, such as heuristic, labeling, combines the labels statistics information and graph segment algorithm (GN algorithm), and proposes a non-heuristic unsupervised page segmentation algorithm. Experimental results have demonstrated that GSPS is comparable to VIPS generally, and GSPS is more robust and more effective than VIPS in the segmentations of homogenous Web pages.3. Information extraction system (wrapper prototype system). It can achieve site-based information extraction. In addition, page clustering subsystem can be used in information retrieval technology; page segment subsystem also can be used in information retrieval technology and segmentation of Web page for small mobile devices.

Keywords/Search Tags:

wrapper, page segmentation, page clustering, DOM tree

Related items

1	Web Page-oriented Handheld Devices Automatically Cutting Technology Research
2	The Research And Implementation On Web Page Segmentation
3	A Web Structure Clustering Algorithm For Mobile Page Adaptive Platform
4	Research Of Web Page Purifying Method Based On Document Object Model
5	Study On Web Data Processing Technology
6	Research And Implementation Of Chinese Web-page Classification Based On Web Data-mining
7	Research On Webpage Recognition Technology Based On Vision And Semantics
8	The Optimization And Implement Of Enterprise Search Engine
9	Research Of A Suffix Tree Based Automatic Wrapper Generation Method
10	Research On WEB Segment Algorithm Based On Mobile Device