Font Size: a A A

Research Of XML Document Clustering

Posted on:2011-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:C LiuFull Text:PDF
GTID:2178330332461278Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
XML (eXtensible Markup Language), as a common data representation and exchan-ge format on the Internet, contains a rich entailment of information. Data mining on XML has become an important part of text mining research, and the Large-scale text clustering is one of effective method to solve data and information mining on the ma-ssive text. An efficient, fast XML clustering mechanism, which can provide better dat-a for decision support, will greatly shorten the information retrieval time, improve the efficiency of data query and find out the potential information value. Therefore, XM-L docoument clustering has become a new research focus of data mining.This article focuses on the technology of XML document clustering. No matter do-cument clustering, classification, or other data mining analysis, the issue that can not be overlooked is document similarity, which is the basis of data mining. Therefore, t-his article pays more attention to XML document similarity calculation method from XML structure and combining with structure and contents. Firstly, this paper extends path model of XML document by adding frequency of path and node, namely the fre-quency-path model. Secondly, based on this model, the similarity calculation algorithm with position and frequency weight by longest common subsequence (PFWLCS) is p-roposed. Experiment results on true data set demonstrated the similarity accuracy with the method PFWLCS proposed in this paper had a good effect in the recall and pre-cision. Thirdly, a new frequenct-path model is metioned by adding XML document el-ement contents. And the similarity calculation algorithm combining with XML docum-ent structure and contents (SCSC, Similarity Calculation with Structure and Contents) i s also described.Finally, the neighbor center clustering algorithm (NCC) with similarity is proposed. And experiments show that the NCC algorithm can obtain high purity and Fmeasure v alue and is suitable for clustering XML with different DTDs.
Keywords/Search Tags:XML Document Clustering, Similarity Computation, Neighbor Center Clustering
PDF Full Text Request
Related items