Font Size: a A A

Research And Application On Clustering For XML Documents

Posted on:2016-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:C Y WangFull Text:PDF
GTID:2308330479976620Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
XML(e Xtensible Markup Language) is an important standard for describing, transferring and exchanging information in Internet, and clustering for XML documents is focus of lots of technologies for integrating and managing them. The target of clustering XML documents is to integrate them effectively for convenience store and management. Calculating similarity of XML documents is the pivotal step of clustering, among lots of traditional algorithms of calculating similarity, the Time complexity of tree edit distance algorithm is too high, tag similarity algorithm lost the structural properties of XML documents, pair and path Similarity algorithm limited and cannot be extended simply, while the calculation of algorithms among vector space is simple, but they also lost part of structural properties because of the method of setting weights for features.In this paper, we focus on setting weights for features of XML documents to do a series of researches among the algorithms of calculating similarity and clustering for XML documents, and obtain the following results:1.In this paper, we research the pq-gram algorithm for calculating similarity of XML documents. Each node of XML documents has different level and different position in its own level, and this difference is a reflection of the structural properties of XML documents, however, the pq-gram algorithm fails to take this into account. In view of this, we propose a new weighted pq-gram algorithm, which designs a new method of setting of weight for nodes and pq-grams by comprehensively considering the level and position of node of XML document tree and combining the position of its parent node, too. Based on the method, this algorithm improves the method of calculating similarity. Finally, we cluster three real XML documents sets by cluster algorithms and compare the pros and cons of the accuracy of clustering and the similarity of clusters.2. On this basis, we study common methods of extracting features for XML documents, and find that most feature extraction algorithm can only set weight according to their structural properties, while fail to take features are ordered into account. In fact, the level of node cannot completely decide its importance to clustering. According to this, we study the algorithm of clustering with feature order preference(CFP) We use weighted pq-gram algorithm to extract features, and combined the CFP algorithm, then we propose the algorithm of clustering XML documents based on feature order preference(CXFP).CXFP algorithm can combine the weight based on structural properties and the weight based on feature order, then update weight dynamically in clustering. Experiments show that this algorithm can significantly improve the accuracy of clustering for combining with feature order.3. We analyze the status quo of airport noise, and introduce the necessity to study this problem. On this basis, we apply CXFP algorithm to clustering airport noise data, set weight for features according to different feature orders, and experiments show that compared with other algorithms, CXFP algorithm can get better accuracy of clustering.
Keywords/Search Tags:Calculating Similarity, Clustering XML Documents, Weight, Feature Order, Airport Noise
PDF Full Text Request
Related items