Font Size: a A A

A Study On Methods Of Web Page Topical Information Extraction

Posted on:2011-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y RenFull Text:PDF
GTID:2178360305995326Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In 21 century, the rapid development of Internet has brought the explosive growth of information spread through the Internet, and the needs from online users are increasing as well. As the main representation of Web information, HTML Web pages are getting more complex in structure, and more rich in page content. A Web page consists of a variety of elements, such as page navigation, advertisements, copyright, topical text and other information; they are distributed to different areas of the page, which are defined as blocks in this paper. We believe that a complete Web page can be divided into several blocks. While browsing Web pages, people always concentrate on the theme of the pages which is called topical blocks in this paper, other than the advertisements and hyperlinks. In order to extract the topical information from Web pages and provide more useful content to Web users, an efficient Web information extraction system must be able to identify and extract the topical text accurately. High quality Web topical texts are also important to the processing ability of information retrieval systems.Based on the long-time analysis on content and structure of massive Web pages and research on Web information extraction technology, we combine page segmentation and entropy methods to identify the topical blocks of Web pages and extract the theme information. The extraction process is carried on as follows:(1) Parsing HTML pages. Due to the fact that Web pages are semi-structured documents, HTML parsing is carried on as the first step to preprocess the pages downloaded from Internet. As a result, HTML pages are represented by more structured DOM trees. By accessing the DOM interface, we can process the Web pages more automatically.(2) Filtering redundant nodes. According to HTML tag feature, we filter the redundant nodes in DOM tree, such as picture nodes , script