Precise Clustering Algorithm For Chinese Text Based On K-means

Posted on:2013-02-04

Degree:Master

Type:Thesis

Country:China

Candidate:X C Zhang

Full Text:PDF

GTID:2218330362462946

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Text clustering plays an important role in data mining and machine learning. Thetechnology has developed greatly and produced a series of achivements in theory.K-means algorithm is one of the classic text clustering algorithms. Due to its low timecomplexity, it has been widely used in the field of text clustering. The paper studies thekey technology and algorithms of this field and proposes a new high-performanceunsupervised feature selection method. According to the defects of the K-meansalgorithm, we make some corresponding improvement work. The major work is asfollowing.First of all, we study feature selection algorithms of text clustering. On this basis,we find some shortcomings of the algorithms. For example, they ignore the potentialassociation between the features. In order to overcome the defect of the existing featureselection algorithms and improve the efficiency of the feature selection, by introducingthe idea of feature clustering idea, we propose a new unsupervised feature selectionalgorithm to guarantee the clustering quality and cut redundant feature words effectively.Secondly, we analyze the defect of the Euclidean distance as the text similaritymeasure. we have also revised the Euclidean distance calculation method. We considernot only the frequency of each word but also their different text clustering contributions.Weighted Euclidean distance calculation method is proposed based on the informationentropy.Thirdly, we propose an accurate K-means clustering algorithm with initial centralpoints optimized. The initial clustering centers selected by traditional K-means algorithmcan not be on behalf of the whole text set very well. In order to ensure that initialclustering centers are more decentralized and more representative, the selection processof initial clustering centers is optimized combined with the revised Euclidean distancecalculation method.Finally, the detailed experimental process is designed to verify the contents of thispaper. Through the contrast with the existing algorithm, we analyse the experimental results, which proves the validity and sophistication of the new algorithms.

Keywords/Search Tags:

Text clustering, Feature selection, Initial clustering center, Redundant feature, Information entropy

PDF Full Text Request

Related items

1	Research On Clustering Algorithm Of K-medoids And Its Application In Text Clustering
2	Research On Text Clustering And Its Application In Topic Detection Analysis
3	Knn Text Classification Algorithm Based On The Semantics Of The Center
4	Research On Patent Text Clustering Based On Improved K-means Algorithm
5	Research On Problems Related To The Initial Center Selection In K-means Clustering Algorithm
6	Research Of Text Clustering Based On Genetic Algorithm
7	Research For Feature Selection Algorithm Based On Text Clustering
8	Research On Feature Selection Methods And Its Applications In Text Clustering
9	Based On CHI And Feature Clustering Text Feature Reduction
10	Research On Clustering Approach For Text Messages