Font Size: a A A

Research On Chinese Text Clustering Method

Posted on:2010-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2178360272979337Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
More and more information is available with the fast development of information technology and database technology since the end of the last century. Most of the information is to the existence of texts. How to find the right information quickly from a large number of texts is getting more and more urgent. Text mining is to deal with these texts and provide people with more information. As an important branch of text mining, text clustering technology is worthy of more researches in the future.K-means algorithm is one of the classical algorithms in text clustering. There are two improved points which were made to adapt large-scale text clustering. First of all, deeper research about initial point in k-means algorithm was done and it was pointed out that the selection of initial points was important for the algorithm. After the research of exited methods, CURE algorithm was applied in k-means algorithm to improve the selection of initial points on the account of the feature of text matrix. By the way of removing the isolated points and the small clusters which both grew slowly, the improved algorithm could reduce the impact of isolated points on clustering centers. Secondly, an improved algorithm about characteristic word selection in text clustering was also given. The mind of dynamic and local PCA was applied in k-means algorithm. At the beginning of text clustering, more text information was retained. In the procession of iteration, the improved algorithm could choose the right characteristic words and used these words in dynamic clustering. The improved algorithm had higher the rate of accuracy and convergence.Finally, the algorithms were verified by experiments, and the results were analyzed, and the advantages and weaknesses of the algorithms were discussed.
Keywords/Search Tags:text clustering, k-means algorithm, CURE algorithm, PCA
PDF Full Text Request
Related items