Font Size: a A A

The Research Of Web Pages Information Extraction Based On Page Structure Analysis Technique

Posted on:2011-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:J DongFull Text:PDF
GTID:2178360308990376Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the fast development of Internet, searching information through Internet becomes a more and more popular activity. However, almost all web pages on the Internet contain noises irrelevant to the main content, such as advertisements, navigator bar. These noises seriously harm Web mining and searching.Therefore, web information extraction technology has emerged. Among the various web page information extraction methods, Page Structure Analysis Technique becomes the hot spot research area because of its excellent ability to understand web pages.First the development of information extraction, working theory and related technologies are introduced. Then the important function of the Page Structure Analysis technology is deeply studied. According to the lack of semantic features of the existing information extraction techniques, form the space feature vector and text feature vectors of web page, and propose a comprehensive formula for calculating the similarity to improve the accuracy of web page information extraction.Towards the shortages of existing Web information extraction approaches, the method partitions a web page into semantic blocks using vision. 12 features are extracted to form a featrue vector and the method computes the distance between blocks using similarity formula. Then cluster the blocks with similar structure and semantic, using visual features identify the informative blocks.Finally the method is compared with another information extraction method and the extraction results of the two methods are applied to a kNN classifier. Experiment shows that this method can effectively extract informative blocks and has higher classification accuracy.
Keywords/Search Tags:Page Structure Analysis, Web Page Information Extraction, VIPS, Vector Space Model, Text Clustering
PDF Full Text Request
Related items