Font Size: a A A

Webpage Segmentation Algorithm Based On Planar Graphs

Posted on:2010-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y TianFull Text:PDF
GTID:2178360302460826Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, people rely on search engine to find useful information from the web, the vast and rapidly growing repository of information. However, since web pages are usually embedded with various kinds of valuable semantic information, they are not atomic units for information search on the Web. Web page segmentation, whose aim is to partition a web page into visually continuous and cohesive regions with unified theme in content and purpose, is an essential task of web page understanding.Traditional approaches to web page segmentation usually use heuristic or machine learning algorithms to analyze the DOM structure of the page, either by visual analysis or interpreting the meaning of tag structures in some way. While these approaches might work well on some kind of pages, they still have problems that limit them to be universal applicable. Very recently, Chakrabarti et al. dealt with web page segmentation from a graph-theoretic perspective, however, the underlying graph of this approach is too complex, making the task of solving the minimization problem very difficult. Thus this approach is not efficient and does not suit for real applications.We propose a novel web page segmentation algorithm based on finding the Gomory-Hu tree in a planar graph. The algorithm firstly distills vision and structure information from a web page to construct a weighted undirected graph, whose vertices are the leaf nodes of the DOM tree and the edges represent the visible position relationship between vertices. Then it partitions the graph with the Gomory-Hu tree based clustering algorithm. Since the graph constructed by both vision and structure information is a planar graph, the algorithm is very efficient. Meanwhile, the quality of graph partition can be guaranteed through the use of Gomory-Hu algorithm. Experimental results show that, compared with VIPS and Chakrabarti et al.'s graph theoretic algorithm, our algorithm improves upon the other two with much higher precision and recall, and its running time is far lower than that of Chakrabarti et al.'s graph theoretic algorithm.
Keywords/Search Tags:Webpage segmentation, HTML DOM tree, Gomory-Hu Algorithm, Planar graph
PDF Full Text Request
Related items