Font Size: a A A

Study Of Chinese Text Clustering On Improved K-means Algorithm

Posted on:2011-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:M LiFull Text:PDF
GTID:2178360305972972Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the information time, various statistics of electronic texts increased sharply. How to manage the complex statistics validly and achieve needed information has become an imperative topic which needs to be tackled immediately. As efficient tools for manage and organize texts, text clustering and categorization have attracted more and more attention and research.In this paper,the main research target is Chinese texts, we do deep research on the whole process of text clustering ,which including two steps:texts pre-processing and clustering. On the analyze the shortcomings of K-means(short for KM) algorithm and bisecting k-means clustering(short for BKM) algorithm, we proposed a novel improved clustering algorithms on the basis of cooperative clustering: cooperate bisecting k-means clustering algorithm (short for CBKM).The main work of this paper is presented as follows:(1)We do analysis and deep research on the main clustering methods and their representative algorithms,and the range of convenience of these algorithms as well as their merits and drawbacks are listed at the same time.(2)A deep research on key technology such as text representation model, distance of texts, data structure text preprocessing and so on is done.(3)The performance of KM is effected by initial cluster centers, there is no uniform standard to obtain for the value of k; Affected by its outlines, KM only have locally optimal solution, not globally. The drawbacks of BKM is fractions of dataset which left behind with on other way to re-cluster if again at any level of the binary tree. Because of the defects of KM and BKM, a new CBKM algorithm is presented inspired by the cooperative clustering. The new algorithm is performed in there steps:global clustering step, cooperative clustering step and merging step. The algorithm is generated in the processing of forming CF tree in BKM with intermediately cooperate with KM. The algorithm introduces the concept of similar histograms which reflect the similarities of sub-clusters. According to the similar histogram, a merging factor mcf between sub-clusters is calculated. We merge two sub-clusters together with the highest mcf values, then update sub-clusters. This procedure can avoid producing cluster fractions effectively. Owing to the merging step, we successfully improved the situation that the clustering performance is affected by initial cluster centers. The CBKM can achieve global optimization.(4)Because of that CBKM is the hybrid of KM and CKM, its time complexity is higher than KM or BKM, but is much lower than sum of the later two.(5)Experiments were done using Sogou corpus, the result show that compare with KM and BKM,CBKM has higher values in mutual, purity, F-measure and lower value of entropy. The improved algorithm can effectively improve the performance.
Keywords/Search Tags:Cooperative Clustering, K-means Algorithm, Bisecting K-means Clustering Algorithm, Vector Space Model
PDF Full Text Request
Related items