Font Size: a A A

Study On The Tag-based Analysis Technique Of Extracting The Body Of The Page

Posted on:2011-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ChangFull Text:PDF
GTID:2178360308958689Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
By analyzing the characteristics and the impact of the noise data of the eight web sites including Sina, Sohu, Netease, Tencent, Baidu, China News Network, CDC, 21CN, and through making use of HTML structural features, a block-based region, HTML element delete law and embedded element extraction method is proposed, which can be used to page body Automatically Using. The content and results of this study are as follows. Of course, these two methods for XHTML and XML is also feasible. For simplicity, this paper, the entire contents of the following will omit the description of the XHTML, and XML.①Based on the analysis of text and web page content relevant and irrelevant text links and image links, the links that are irrelevant to the body of the page Estimation Model are put forward by referring to the combination of HTML tag and its contents.②This paper analyzes comprehensively the characteristics of the image tag in a large number of news pages on the basis of the statistics and defines a more accurate range of images that are relevant and unrelated to the text and web content.③After the analysis of the techniques of extraction of the body, such as the traditional DOM tree, static pages expressions etc, this paper presents the concept of regional block, identifies the location of the title page similar to the rate the body, and on this basis, this paper further proposes two methods of extraction of web page text content based on the HTML tags: block-based region, HTML element removed and element embedded extraction.④Taking eight News Web sites as a test set, relevant examination has been carried out by resorting to the two proposed methods of the pages body extraction. It proves that the two proposed methods of the pages body extraction are superior to the traditional extraction method by comparison of them.In summary: first, we propose a regional block of HTMLelements embedded block extraction and region to delete the element method, and we can accurately extract the HTML documents of the subject with the structure and content of the page unchanged and without depending on the structure of the source page. It is an automatic, reliable and versatile method. Second, because the method is based on HTML specification and extracted the contents and structure of web pages with the same source, it is of high scalability. Finally, through combining with the web capture program for HTML, documents is preprocessed to extract thematic content, which significantly improves the retrieval efficiency and precision, and therefore the application of the method has been considered to be valuable, and it not only meets the PDA and the demand for mobile phone users instant access, but also can be used in the field of automatic summarization of information retrieval and automatic classification system, so it is of strong application value.
Keywords/Search Tags:regional sub-block, page text, link to determine model, image tags, HTML documents
PDF Full Text Request
Related items