Research Of XML Similarity Based Edit Graph

Posted on:2012-08-12

Degree:Master

Type:Thesis

Country:China

Candidate:Z Li

Full Text:PDF

GTID:2178330332499667

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of accordingly the data, HTML defects increasingly obvious, the traditional web technology can't satisfy the needs of internet, the semi-structured XML solve this problem. XML has strong expansibility and readability, It can effectively describe all kinds of data, and play more and more important role in data representation and data exchange. So, KDD and the database must support to XML. The similarity between XML documents is the foundation of document clustering,KDD and information retrieval, the research of XML similarity is the hotspot now.We pay attention to algorithms for comparing XML. Many algorithms for comparing XML are already published. They are classified in three classes:Edit Distance (ED) based methods. Information Retrieval (IR) based methods, and other methods based on different techniques to XML comparison. ED-based methods transform XML documents into XML document trees, find the edit distance between tree structures with dynamic programming techniques. Most of these methods are used to document/document comparison tasks. They target rigorously data-centric XML and are usually fine-grained. They are mainly used for applications that require accurate detection of XML document structural similarities, like XML classification/clustering applications and data integration. ED algorithms can produce edit scripts, an edit script produces a similarity value, along the script. And we can describe the change between trees. The edit graph algorithm is fast and effective, and is the basis of many ED algorithms. We are the same.We first introduce the concept and process of KDD, the related concepts and characteristics of XML. Secondly we expound the original edit graph algorithm, and the related knowledge are presented, include representation model and edit graph. The main idea of edit graph algorithm is that we transform a XML document tree into a sequence with the special standards, the result of comparing the sequence as comparing trees. The order of brother nodes is very important, but it contradicts the characteristics of data-centric XML documents, so the final result is not accurate enough. Then we analysis the algorithm, find out the deficiency and improve. At last, we publish a new algorithm based edit graph: splitting edit graph algorithm. We transform XML trees into the sub-path sets, and compare the path sets as compare trees.We illustrate algorithm calculations with an example, and analyze the characteristics of the algorithm and deficiency. Then we compare edit graph algorithm and splitting edit graph algorithm by doing some experiments. The order of nodes that locate at the same level does not affect the final result, the new algorithm is more accurate than edit graph algorithm. Like other ED algorithms, splitting edit graph algorithm can produce edit scripts, describe the process of converting. Make sure to distinguish between nodes these have the same sign. When we calculation edit distance, only one edit operation effective for each node. Finally, we make the conclusions and prospect the splitting edit graph algorithm. It will play very well in document clustering,KDD and information retrieval. And it has strong expansibility. It could have new characteristics when we use new methods to compare two sub-path sets.

Keywords/Search Tags:

XML, XML similarity, Edit graph, Split

PDF Full Text Request

Related items

1	Research On Graph Similarity Queries With Edit Distance Constrains
2	Research On Graph Search Problem Based On Edit Distance
3	Research On Graph Similarity Search On Uncertain Graph Databases
4	Research On Graph Similarity Matching Algorithm
5	Research On The Solving Technology Of Graph Edit Distance
6	Research On Graph Matching Algorithm Based On Edit Distance And Graph Embedding
7	Research On Top-k String Similarity Search Based On Edit Distance
8	Research On Graph-based Workflow Recommendation Using CNN
9	Research On String Edit Similarity Join
10	Research On Depth Image Denoising Algorithm Of Joint Graph Model