Font Size: a A A

Research Of XML Semantic Clustering Based On Weighted Edge Set Comparison Algorithm

Posted on:2011-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2178360305450706Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
XML (eXtensible Markup Language) with the simple, scalable, strong inter-operable and open features is becoming a kind of standards and transmission format for data exchange, which is unrelated to the technology. Compared with HTML, XML has greater flexibility. It not only can be used to tag the text of unstructured information but also can be used to mark highly structured data (e.g. data in the database) With the rapid growth of XML data on the Web, how to help users quickly and efficiently retrieve a large number of XML data and get the useful information will become an urgent issue to resolve.Document clustering is an effective means to help people retrieve information. In order to effectively analyze the information in the XML document, so the research of XML document clustering has become a hotspot in current research. The key point of XML Document Clustering is measure of the document similarity. As XML documents is Half-Structure text, and its information Can be described via documents structure. Thus, not all the text similarity algorithm is available for XML documents clustering.The current calculation methods of XML document similarity are:the method of elements comparison, edge set comparison algorithm and tree edit distance method. The elements comparison method is simple and fast, but it only considers the number of nodes, it does not take into account the structural complexity of XML document tree, so the clustering results are not very satisfactory. The tree edit distance method takes into account the complex structure of XML document tree and nodes similarity, and it can get a good clustering result, but it has a higher time complexity. The performance of edge set comparison method is between elements comparison method and edit distance method. This paper just extends edge set comparison method, and proposes the weighted edge set comparison algorithm, which eliminates the nested and repeated nodes of the XML document tree, and gets the effective simplified the XML labeled tree. It combines semantic information to measure the similarity between XML documents. After getting the similarity among the XML trees, it uses classified clustering method to cluster XML documents.Based on the classic edge set comparison algorithm, this paper makes the innovation as following:1. The idea of edge set comparison algorithm with weight is proposed. It gives some weight for each side of the XML summary tree according to the structure complexity and the level, so it strengthens importance of the structure and levels of the XML tree.2. The new algorithm calculates the edges similarities of XML labeled tree combined with semantic information, then gets the set of semantically equivalent edges so as to determine similarity between the two XML labeled trees.The experiments show that the semantic-based weighted edge set comparison algorithm has better clustering results.
Keywords/Search Tags:Data Mining, XML Cluster ing, Edge Set Comparison Algorithm, Semantic Similarity
PDF Full Text Request
Related items