Study On The Tag-based Analysis Technique Of Extracting The Body Of The Page

Posted on:2011-01-24

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Chang

Full Text:PDF

GTID:2178360308958689

Subject:Computer software and theory

Abstract/Summary:

By analyzing the characteristics and the impact of the noise data of the eight web sites including Sina, Sohu, Netease, Tencent, Baidu, China News Network, CDC, 21CN, and through making use of HTML structural features, a block-based region, HTML element delete law and embedded element extraction method is proposed, which can be used to page body Automatically Using. The content and results of this study are as follows. Of course, these two methods for XHTML and XML is also feasible. For simplicity, this paper, the entire contents of the following will omit the description of the XHTML, and XML.â‘ Based on the analysis of text and web page content relevant and irrelevant text links and image links, the links that are irrelevant to the body of the page Estimation Model are put forward by referring to the combination of HTML tag and its contents.â‘¡This paper analyzes comprehensively the characteristics of the image tag in a large number of news pages on the basis of the statistics and defines a more accurate range of images that are relevant and unrelated to the text and web content.â‘¢After the analysis of the techniques of extraction of the body, such as the traditional DOM tree, static pages expressions etc, this paper presents the concept of regional block, identifies the location of the title page similar to the rate the body, and on this basis, this paper further proposes two methods of extraction of web page text content based on the HTML tags: block-based region, HTML element removed and element embedded extraction.â‘£Taking eight News Web sites as a test set, relevant examination has been carried out by resorting to the two proposed methods of the pages body extraction. It proves that the two proposed methods of the pages body extraction are superior to the traditional extraction method by comparison of them.In summary: first, we propose a regional block of HTMLelements embedded block extraction and region to delete the element method, and we can accurately extract the HTML documents of the subject with the structure and content of the page unchanged and without depending on the structure of the source page. It is an automatic, reliable and versatile method. Second, because the method is based on HTML specification and extracted the contents and structure of web pages with the same source, it is of high scalability. Finally, through combining with the web capture program for HTML, documents is preprocessed to extract thematic content, which significantly improves the retrieval efficiency and precision, and therefore the application of the method has been considered to be valuable, and it not only meets the PDA and the demand for mobile phone users instant access, but also can be used in the field of automatic summarization of information retrieval and automatic classification system, so it is of strong application value.

Keywords/Search Tags:

regional sub-block, page text, link to determine model, image tags, HTML documents

Related items

1	Based On The Theme Of The Html Tags Crawler Design And Realization
2	Improving Web retrieval by mining the HTML tags for keywords and exploring the hyperlink structures of Web pages
3	The Research And Implementation On Web Page Segmentation
4	Printed Documents Source Identification Using Geometric Distortion On Text Lines
5	Research And Implementation On Key Technology Of Web Text Collection And Analysis
6	The use of social tags in text and image searching on the Web
7	The Research Of Text-based Image Retrieval Technology In Uyghur Kazak Kirgiz Search Engine
8	The Research And Implementation Of Web Text Classification Based On Web Text Segmentation
9	Semantic hierarchies of HTML documents and their applications
10	Reasrch On The Intelligent Acquisition Of Web-Based News Contents