A Framework Of Web Page Analysis And Content Extraction Based On Coordinate Tree

Posted on:2007-09-05

Degree:Master

Type:Thesis

Country:China

Candidate:B Liu

Full Text:PDF

GTID:2178360185468267

Subject:Circuits and Systems

Abstract/Summary:

PDF Full Text Request

Due to the growth of techniques of dynamic page generation, the amount and the complexity of Web pages increase explosively, and so does the information contained in Web pages. To increase the commercial value and the accessibility of pages, most of the content sites tend to publish their pages with intra-site redundant information, such as navigation panels, advertisements, copyright announcements, etc. Such redundant information will increase the index size of general search engines and make the topics of pages drifted. Different kinds of information, including redundant and irrelevant information, are distributed and mixed in a page and it is hence difficult for machines to automatically identify useful information from the page. Such a phenomenon not only increases the cost for search engines to index Web pages, but also make it difficult for users with small display devices to surf Web pages. In this dissertation, we propose a novel system by using page layout analysis and content extraction to get the informative parts in pages and improve the service qualities of web applications based on page content.Considering the semi-structure of HTML document and lack of position information and description about spatial relation between leaf nodes of the DOM tree, a new framework of Web page analysis and content extraction, which includes a novel Coordinate tree model containing position information and a graph model reflecting the spatial relations, is proposed. By transforming HTML documents into Coordinate trees, the web pages are analyzed and extracted based upon the features of position and spatial relations.Experiment result on a set of 5000 web pages from 120 different sites shows that our approach can achieve 93.78% in accuracy. And it also...

Keywords/Search Tags:

page layout analysis, content extraction, DOM, coordinate, tree heuristic rules

PDF Full Text Request

Related items

1	Study On Web Content Extraction And Semantic Recognition
2	Research On Content Extraction In HTML Web Pages Based Multi-Features
3	Analysis Of Deep Web Page's Structure And Its Rich-Content Extraction
4	Research On WEB Page Structure And Data Extraction Technology
5	Design And Implementation Of Education News Webpage Information Extraction System
6	The Designation And Implementation Of Business Insight System Base On Web Content
7	Study On Heuristic Ant Colony Algorithm For The Circles Layout Problem With Performance Constraints
8	Research On Web Page Area Weight For The Layout Of Personalized Recommendation Content
9	Extract Information Based On Semantic And Layout Of Online Characters
10	Research And Implementation Of WEB Page Body Information Extraction Based On DOM Tree