Font Size: a A A

The Research And Implementation On Content Extraction In Web Pages Based Page Segmentation

Posted on:2011-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:M MiaoFull Text:PDF
GTID:2178360305483079Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of the Internet age, Web has become the world's largest source of information. It has brought great convenience to people's lives. But the Web also makes people face a huge challenge in the effective use of them when it offers a wealth of information to people at the same time. On the one hand the information on the Internet is various and colorful, on the other hand users can not find the information they need. In order to use the information on the Web effectively, people constantly study the technology that can organize and use the online information. However, Web documents are not as neat and clean as the conventional text. It contains a lot of noise contents such as the script joined to enhance the user interaction, the navigation links joined to facilitate users to browse, as well as the advertisement links joined for business factors. These noise contents and Web pages are usually not theme-related. In the Web applications such as Web page classification and information retrieval, if we do not remove these noise contents, it will not only reduce the implementation speed of information processing, but also influence the accuracy of classification and retrieval. For example, in information retrieval application, it may submit a Web page to users just because the page advertisement information contains the keywords searched by the users. Therefore, extracting the themes and theme-related content from the Web pages fast and accurately has become an essential link to the pre-processing link of Web information processing system.In this paper, the main study made the following points in content extraction:(1)Proposed and implemented a theme-based page recognition algorithm. It used multiple features of theme-based pages. At first, the Web pages are filtered by heuristic rules. Then the pages that can not be recognized will be classified through the classifier. The results show that this algorithm has better recognition effects.(2)To aim at that the past page segmentation algorithms used only one kind of clues, the paper proposed and implemented a Multi-clues Based Page Segmentation Algorithm. This algorithm makes a comprehensive utilization of the label clues, visual clues and text clues on the page to divide the page into blocks. At the same time it generates the structure of semantic block tree which retains information of the semantic block such visual information to facilitate the future use. The experiment results proved that compared with the existing segmentation algorithm, this algorithm improves the accuracy of segmentation. It is more robust and applicable to a wider range.(3)Summarize the main features of Web content block, proposed and implemented a theme content block identification algorithm based on combination features. This algorithm combines recognition algorithm based on the text feature and recognition algorithm based on layout feature together. The recognition algorithm based on text feature is biased towards the text content within the semantic blocks, while the algorithm based on layout features reflect the semantic structure information within the blocks. The theme content blocks identified by the algorithm based on combination features can not only reflect the importance of the text in the Web page, but also reflect the importance of its internal structure. It prevents bias lead by using the single feature and improves the precision and recall of theme content information extraction.(4) In the field of theme-relative content extraction, through heuristic rules, implemented the relevant links extraction algorithm and relevant images extraction algorithm.
Keywords/Search Tags:theme-based pag, Web page segmentation, content extraction, semantic block
PDF Full Text Request
Related items