Font Size: a A A

Research On Content Extraction In HTML Web Pages Based Multi-Features

Posted on:2009-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:L X LiFull Text:PDF
GTID:2178360245495015Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Web pages often contain rich and different contents, and can be divided into topic-related content and not related to the topic, Identifying topic-related Web content for retrieval, classification, and so on, can save space and improve the performance of these applications to a large extent. Such studies have been a lot and have achieved great results, more research and wider application is the thinking of using the content block of the Web pages: A page is made up of a number of isolated pieces of aggregation, and then we can identify and obtain the topic content which is the needed relevant content, namely topic-content block. The process of identifying and extracting the web page's topic-content is called Web content extraction. A web page usually consists of the page title, the text or image blocks describing the main content of the page, navigation links, decoration parts, interaction and contact information. Clearly the later contents are not closely related to the topic of the web page. Since web pages are usually classified as Hub-Pages and Authority pages, where a Hub-page mainly consists of the navigation links of authority pages and an Authority pages provides the textual description on a topic, then content extraction from web pages can be categorized as two types, i.e., for a Hub-page content extraction is to find the links pointing to the authority pages; and for an Authority page content extraction is to recognize the text blocks describing the topic of the web page. Our study focuses on the following aspects:First, this paper introduces the types of Web pages and analyzes several effective algorithms of sorting Web pages type, based which we propose an improved method of sorting Web pages type. This improved algorithm is divided into two stages. First, Web page is segmented into number pieces of information block using VIPS algorithms, then we determine the type of each one information block, and in accordance with the information whether there is a block to meet the requirements of the topic information block to determine the type of Web page. The experimental results show that our method could judges the type of the Web page efficiently with the correct rate of 98.6%. Secondly, the paper summarizes various methods of the content-extraction of Web pages in the past, and based on which, we propose a new web content-extraction algorithm. It analyzes each block of the Web pages and find out a number of characteristics of the topic-content block, based on the Web pages' content segmentation. Then using probability theory to quantify these characteristics, and gain a probability relation of each characteristic with topic-content block. Finally we calculate the probability of each content block using the comprehensive characteristics of this information block, and compare with the threshold value to judge the information block nature. Through experiments we could clearly see: the new algorithm effectively extracts the topic-content of the web pages and is superior to other similar algorithm.Finally, this paper shows two specific applications of Web pages' content-extraction: Hidden Web classification and Web retrieval. In the Hidden Web categories, through using the content-extraction algorithm of this paper, we could discover the information of the Hidden Web's text description, and make it as a classification factor of Hidden Web. Ultimately it improves the classification results obviously. In Web retrieval, we extract the Web pages' topic content with the new algorithm of this paper, index and retrieval experimental set. Comparing with similar methods, results show that it has a large degree of raising the retrieval rate of accuracy, recall rate, and other indicator.As Web content segmentation's application and promotion, the paper analyzes two effective Web content segmentation algorithms in details, and compares them in the experiment. This paper presents new algorithms which depend on the effectiveness of Web content segmentation, so further enhancing the accuracy and rationality of Web content segmentation, as well as finding more properties of topic-content block are all the useful means to improve new algorithms' effect.
Keywords/Search Tags:Content-extraction, Page-cleaning, Content segmentation, Page analysis
PDF Full Text Request
Related items