Font Size: a A A

Research On Approximate XML Joins

Posted on:2013-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:F HeFull Text:PDF
GTID:2268330392467826Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the network, more and more information appeared inthe Internet. XML (eXtensible Markup Language) is the most popular dataexchanging and data storing tool in the network. XML documents from differentsources may represent the same or similar information, and cause a large number ofredundant. Integration of the same or similar information is meaningful, becauseusers can remove redundant information from integrated XML documents to and getmore complete and useful information.This paper introduces several XML similarity measures, and presents a newXML similarity measure based on XML subtree matching. In the XML subtreesimilarity measure, this paper not only considers the PCDATA value of the subtree’sleaf nodes, but also considers path similarity of the matching leaf nodes. Thedefinition of the subtree similarity in this paper is based on text and path similarity.Based on subtree similarity, this paper proposes the XML similarity measurealgorithm and XML similarity join algorithm. The Experimental results show that thesubtree similarity calculation can help the XML documents join.Most XML clustering algorithms are based on tree edit distance, and compareeach pair of the XML documents. With the increase of the number of XMLdocuments, clustering time will increase dramatically. This paper adds semanticinformation to XML hierarchical structure. According to the hierarchical structure ofXML, this paper proposed a new XML document similarity measure. By makingsome changes, CLOPE incremental XML clustering method can be used in XMLdocuments clustering, and without comparing each pair of documents. Experimentsprove that the incremental XML clustering method avoids comparing each pair of theXML documents, and greatly improve the efficiency of XML clustering.
Keywords/Search Tags:Subtree matching, XML join, Similarity measure, Cluster Analysis
PDF Full Text Request
Related items