Font Size: a A A

Research On Mining Structure Of WEB Page For Information Extraction

Posted on:2011-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2178330338481049Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Information extraction is an important technology of extracting valuable information and knowledge from massive Web pages, within which Web page structure mining and extraction is a key step. However, most of existing page structure mining algorithms rely on heuristic rules or manually labeling, which makes that either the efficiency or the scalability can not meet the requirements of practical application for the massive and heterogeneous Web pages. Thereforef, the development of information extraction application urgently requires more intelligent, automated technology of page structure mining.Based on the above background, we analyze and study two key technologies for Web page structure mining, i.e., page clustering technology and page segment technology. We have found that traditional methods make use of tages very heusitisticaly. Aimed at the fact, we propose statistical information based tag vector which provides solid technical foundation for the page clustering algorithm and page segment algorithms in this paper. Our main contributions are as follows:1. Matrix structure based page clustering algorithm (MSPC). The MSPC algorithm treats all pages as matrices of the same size, and its computational complexity is only affected by the sorting algorithms. We have not only proved that MSPC is a fast algorithm in theory, but also demonstrated MSPC is more effective in clustering web pages than traditional clustering algorithms that have the same time compleixty.2. Graph and Statistic Based Page Segment (GSPS). GSPS discards a series of semi-supervision or supervision methods, such as heuristic, labeling, combines the labels statistics information and graph segment algorithm (GN algorithm), and proposes a non-heuristic unsupervised page segmentation algorithm. Experimental results have demonstrated that GSPS is comparable to VIPS generally, and GSPS is more robust and more effective than VIPS in the segmentations of homogenous Web pages.3. Information extraction system (wrapper prototype system). It can achieve site-based information extraction. In addition, page clustering subsystem can be used in information retrieval technology; page segment subsystem also can be used in information retrieval technology and segmentation of Web page for small mobile devices.
Keywords/Search Tags:wrapper, page segmentation, page clustering, DOM tree
PDF Full Text Request
Related items