Font Size: a A A

Research Of XML Similarity Based Edit Graph

Posted on:2012-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z LiFull Text:PDF
GTID:2178330332499667Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of accordingly the data, HTML defects increasingly obvious, the traditional web technology can't satisfy the needs of internet, the semi-structured XML solve this problem. XML has strong expansibility and readability, It can effectively describe all kinds of data, and play more and more important role in data representation and data exchange. So, KDD and the database must support to XML. The similarity between XML documents is the foundation of document clustering,KDD and information retrieval, the research of XML similarity is the hotspot now.We pay attention to algorithms for comparing XML. Many algorithms for comparing XML are already published. They are classified in three classes:Edit Distance (ED) based methods. Information Retrieval (IR) based methods, and other methods based on different techniques to XML comparison. ED-based methods transform XML documents into XML document trees, find the edit distance between tree structures with dynamic programming techniques. Most of these methods are used to document/document comparison tasks. They target rigorously data-centric XML and are usually fine-grained. They are mainly used for applications that require accurate detection of XML document structural similarities, like XML classification/clustering applications and data integration. ED algorithms can produce edit scripts, an edit script produces a similarity value, along the script. And we can describe the change between trees. The edit graph algorithm is fast and effective, and is the basis of many ED algorithms. We are the same.We first introduce the concept and process of KDD, the related concepts and characteristics of XML. Secondly we expound the original edit graph algorithm, and the related knowledge are presented, include representation model and edit graph. The main idea of edit graph algorithm is that we transform a XML document tree into a sequence with the special standards, the result of comparing the sequence as comparing trees. The order of brother nodes is very important, but it contradicts the characteristics of data-centric XML documents, so the final result is not accurate enough. Then we analysis the algorithm, find out the deficiency and improve. At last, we publish a new algorithm based edit graph: splitting edit graph algorithm. We transform XML trees into the sub-path sets, and compare the path sets as compare trees.We illustrate algorithm calculations with an example, and analyze the characteristics of the algorithm and deficiency. Then we compare edit graph algorithm and splitting edit graph algorithm by doing some experiments. The order of nodes that locate at the same level does not affect the final result, the new algorithm is more accurate than edit graph algorithm. Like other ED algorithms, splitting edit graph algorithm can produce edit scripts, describe the process of converting. Make sure to distinguish between nodes these have the same sign. When we calculation edit distance, only one edit operation effective for each node. Finally, we make the conclusions and prospect the splitting edit graph algorithm. It will play very well in document clustering,KDD and information retrieval. And it has strong expansibility. It could have new characteristics when we use new methods to compare two sub-path sets.
Keywords/Search Tags:XML, XML similarity, Edit graph, Split
PDF Full Text Request
Related items