Font Size: a A A

A Framework Of Web Page Analysis And Content Extraction Based On Coordinate Tree

Posted on:2007-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:B LiuFull Text:PDF
GTID:2178360185468267Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Due to the growth of techniques of dynamic page generation, the amount and the complexity of Web pages increase explosively, and so does the information contained in Web pages. To increase the commercial value and the accessibility of pages, most of the content sites tend to publish their pages with intra-site redundant information, such as navigation panels, advertisements, copyright announcements, etc. Such redundant information will increase the index size of general search engines and make the topics of pages drifted. Different kinds of information, including redundant and irrelevant information, are distributed and mixed in a page and it is hence difficult for machines to automatically identify useful information from the page. Such a phenomenon not only increases the cost for search engines to index Web pages, but also make it difficult for users with small display devices to surf Web pages. In this dissertation, we propose a novel system by using page layout analysis and content extraction to get the informative parts in pages and improve the service qualities of web applications based on page content.Considering the semi-structure of HTML document and lack of position information and description about spatial relation between leaf nodes of the DOM tree, a new framework of Web page analysis and content extraction, which includes a novel Coordinate tree model containing position information and a graph model reflecting the spatial relations, is proposed. By transforming HTML documents into Coordinate trees, the web pages are analyzed and extracted based upon the features of position and spatial relations.Experiment result on a set of 5000 web pages from 120 different sites shows that our approach can achieve 93.78% in accuracy. And it also...
Keywords/Search Tags:page layout analysis, content extraction, DOM, coordinate, tree heuristic rules
PDF Full Text Request
Related items