Font Size: a A A

Web Page Structure Similarity Algorithms And Applications,

Posted on:2009-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:X HeFull Text:PDF
GTID:2208360272959191Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
This paper addresses the problem of evaluating similarity of web pages,which is an important task in web information processing and of great value in data extraction and web information retrieving.An effective measure of similarity between pages is proved to be important for improving precision and performance of data extraction, and it can also enhance performance of query engines,increase quality of data that the engine returned,and reduce the storage that the redundant data using.The algorithms that perform this task is always time-consuming.Traditionally,tree model(such as DOM tree) has been used to model the structural information of HTML documents.However,DOM tree model displays only the nested structure of HTML tags.It does not take the inner structural information such as repetitive(similar) subtrees into account,even though this kind of repetitive information is rather important when we evaluate the similarity of web documents. To solve this problem,this paper proposes an alternative scheme called "Tag-Bag tree" model.This model makes uses of "bag node" to represent the repetitive elements in the tree,and thus is able to capture this kind of valuable structural information in semi-structural documents which plays an important role in many web applications. Based on our tree model,we propose an algorithm,called CTM(Complex Tree Matching),to compute maximum matching of two Tag-Bag trees,and then calculate similarity of the documents.CTM is rather fast for it is a restricted matching algorithm,in which node replacement and level crossing are not allowed.Since our method reduces the number of edit operations allowed,it will save much time, compared with existing methods.CTM also has the advantage of being able to identify one single repetitive pattern.The experiments of many real data sets show that our method outperforms the traditional algorithm,Bag of XPath Model,in both speed and precision.
Keywords/Search Tags:similarity of Web document, tree matching, structural similarity
PDF Full Text Request
Related items