Research Of XML Document Clustering

Posted on:2011-10-28

Degree:Master

Type:Thesis

Country:China

Candidate:C Liu

Full Text:PDF

GTID:2178330332461278

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

XML (eXtensible Markup Language), as a common data representation and exchan-ge format on the Internet, contains a rich entailment of information. Data mining on XML has become an important part of text mining research, and the Large-scale text clustering is one of effective method to solve data and information mining on the ma-ssive text. An efficient, fast XML clustering mechanism, which can provide better dat-a for decision support, will greatly shorten the information retrieval time, improve the efficiency of data query and find out the potential information value. Therefore, XM-L docoument clustering has become a new research focus of data mining.This article focuses on the technology of XML document clustering. No matter do-cument clustering, classification, or other data mining analysis, the issue that can not be overlooked is document similarity, which is the basis of data mining. Therefore, t-his article pays more attention to XML document similarity calculation method from XML structure and combining with structure and contents. Firstly, this paper extends path model of XML document by adding frequency of path and node, namely the fre-quency-path model. Secondly, based on this model, the similarity calculation algorithm with position and frequency weight by longest common subsequence (PFWLCS) is p-roposed. Experiment results on true data set demonstrated the similarity accuracy with the method PFWLCS proposed in this paper had a good effect in the recall and pre-cision. Thirdly, a new frequenct-path model is metioned by adding XML document el-ement contents. And the similarity calculation algorithm combining with XML docum-ent structure and contents (SCSC, Similarity Calculation with Structure and Contents) i s also described.Finally, the neighbor center clustering algorithm (NCC) with similarity is proposed. And experiments show that the NCC algorithm can obtain high purity and Fmeasure v alue and is suitable for clustering XML with different DTDs.

Keywords/Search Tags:

XML Document Clustering, Similarity Computation, Neighbor Center Clustering

PDF Full Text Request

Related items

1	Research On Semantic Similarity Computation And Applications
2	Research On Clustering Of Uncertain Data
3	Clustering Research Of XML Document
4	The Research And Application Of Spectral Clustering Algorithm Based On Neighbor Similarity Graph
5	Manifold Density Peak Clustering Algorithm And Its Application Of Weibo Text Classification
6	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework
7	Grid-based Clustering Algorithm Analysis And Research
8	Document Clustering Method Based On WAF
9	The Application Research Of Incremental Clustering For Document Update Sumarization
10	Research On Text Clustering Algorithm Based On Spectral Clustering