Font Size: a A A

Precise Clustering Algorithm For Chinese Text Based On K-means

Posted on:2013-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:X C ZhangFull Text:PDF
GTID:2218330362462946Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Text clustering plays an important role in data mining and machine learning. Thetechnology has developed greatly and produced a series of achivements in theory.K-means algorithm is one of the classic text clustering algorithms. Due to its low timecomplexity, it has been widely used in the field of text clustering. The paper studies thekey technology and algorithms of this field and proposes a new high-performanceunsupervised feature selection method. According to the defects of the K-meansalgorithm, we make some corresponding improvement work. The major work is asfollowing.First of all, we study feature selection algorithms of text clustering. On this basis,we find some shortcomings of the algorithms. For example, they ignore the potentialassociation between the features. In order to overcome the defect of the existing featureselection algorithms and improve the efficiency of the feature selection, by introducingthe idea of feature clustering idea, we propose a new unsupervised feature selectionalgorithm to guarantee the clustering quality and cut redundant feature words effectively.Secondly, we analyze the defect of the Euclidean distance as the text similaritymeasure. we have also revised the Euclidean distance calculation method. We considernot only the frequency of each word but also their different text clustering contributions.Weighted Euclidean distance calculation method is proposed based on the informationentropy.Thirdly, we propose an accurate K-means clustering algorithm with initial centralpoints optimized. The initial clustering centers selected by traditional K-means algorithmcan not be on behalf of the whole text set very well. In order to ensure that initialclustering centers are more decentralized and more representative, the selection processof initial clustering centers is optimized combined with the revised Euclidean distancecalculation method.Finally, the detailed experimental process is designed to verify the contents of thispaper. Through the contrast with the existing algorithm, we analyse the experimental results, which proves the validity and sophistication of the new algorithms.
Keywords/Search Tags:Text clustering, Feature selection, Initial clustering center, Redundant feature, Information entropy
PDF Full Text Request
Related items