Font Size: a A A

Research Of VSM-Based Chinese Text Clustering Algorithms

Posted on:2009-07-05Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y YaoFull Text:PDF
GTID:2178360242477085Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Text Clustering, one of the most important research braches of clustering, is the application of clustering algorithm in Text Processing.This paper makes relatively deep discussion in the field of VSM(Vector Space Model)-Based Chinese Text Clustering algorithms. By using Open Source Corpuses, it discusses with the strengths and weaknesses of VSM-Based algorithms and presents optimizations of Text Clustering algorithms, including dimension determining, feature selection etc.Firstly, this paper turns back to the achievement in the field of Chinese Text Clustering; it lists the basic research in the areas of feature selection and dimension determining. Moreover, it also discusses with the Chinese Text algorithms and introduces basic knowledge of Clustering Validity.On the basis of these works, by doing research with the Open Source Corpus of Sogou Laboratory, this paper implements several Clustering algorithms. According to the effects of clustering of the corpus, it discusses with the strengths and weakness of these algorithms. The results indicate that the Hierarchical Method can obtain a better result than the Partitioning Method, but its time consumption is longer.Finally, to improve the clustering effect, this paper presents some optimizations of clustering algorithms, including dimension determining, feature selection etc. These optimizations not only take into consideration the feature words themselves, but also the information involved and the relationship between the words. It is proved that these optimizations can effectively improve the effects of text clustering on the corpus. Based on PP and PR, two common indexes of Clustering Validity, the veracity of text clustering has been improved by 11.4% and 20.5% at most.
Keywords/Search Tags:VSM, Text Clustering, Corpus
PDF Full Text Request
Related items