Web Page Structure Similarity Algorithms And Applications,

Posted on:2009-10-09

Degree:Master

Type:Thesis

Country:China

Candidate:X He

Full Text:PDF

GTID:2208360272959191

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

This paper addresses the problem of evaluating similarity of web pages,which is an important task in web information processing and of great value in data extraction and web information retrieving.An effective measure of similarity between pages is proved to be important for improving precision and performance of data extraction, and it can also enhance performance of query engines,increase quality of data that the engine returned,and reduce the storage that the redundant data using.The algorithms that perform this task is always time-consuming.Traditionally,tree model(such as DOM tree) has been used to model the structural information of HTML documents.However,DOM tree model displays only the nested structure of HTML tags.It does not take the inner structural information such as repetitive(similar) subtrees into account,even though this kind of repetitive information is rather important when we evaluate the similarity of web documents. To solve this problem,this paper proposes an alternative scheme called "Tag-Bag tree" model.This model makes uses of "bag node" to represent the repetitive elements in the tree,and thus is able to capture this kind of valuable structural information in semi-structural documents which plays an important role in many web applications. Based on our tree model,we propose an algorithm,called CTM(Complex Tree Matching),to compute maximum matching of two Tag-Bag trees,and then calculate similarity of the documents.CTM is rather fast for it is a restricted matching algorithm,in which node replacement and level crossing are not allowed.Since our method reduces the number of edit operations allowed,it will save much time, compared with existing methods.CTM also has the advantage of being able to identify one single repetitive pattern.The experiments of many real data sets show that our method outperforms the traditional algorithm,Bag of XPath Model,in both speed and precision.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Research On Semantic Similarity Computation And Applications
2	Research On Ontology Matching Based On Word Embedding And Structural Similarity
3	Objective Image/Video Quality Assessment Based On Structural Similarity
4	Based On Structural Similarity And Sparse Repres-Entation Research On FR_IQA Algorithm
5	Design And Implement Of Dulplicate Document Detection Based On Similarity Estimation
6	Document similarity based on concept tree distance
7	The Measurement Of The Structural Similarities Of XML Document Graphs
8	C Code Similarity Measurement Algorithm Based On Levenshtein Distance
9	Study On Image Inpainting Algorithm Based On Structural Similarity
10	The Algorithms Of Image Super Resolution Recovery Based On Structural Similarity