Based On K-means The Chinese Text Clustering Algorithm

Posted on:2010-11-16

Degree:Master

Type:Thesis

Country:China

Candidate:R Zhang

Full Text:PDF

GTID:2208360272994285

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

As a widely used algorithm in machine learning and data-mining, k-means is also used in document clustering for its low time complexity .This paper mainly focus on the how to improve the performance of document clustering algorithm. Based on existing research, improved k-means algorithms and new feature selection method are proposed. Design and implement a Chinese document clustering System on the basis of the proposed algorithms. Works achieved in this paper are as follow:1) It is hard to select features for unsupervised feature selection methods used in clustering due to the lack of class label information. Based on document frequency and term contribution, greedy algorithm is introduced to select features incrementally .Experiments show that the proposed method can remove more features than traditional methods without degrading the clustering quality.2) In order to improve the clustering quality of k-means, well separated initial centroids should be selected. Initial centroids are aurally hard to select due to the high dimensionality and sparseness of document data. A new method for selecting initial centroids is proposed. Experiment show that the centroids selected by the proposed method are well separated and with high representative.3) In order to improve clusters quality of the bisecting k-means, neighbor used in shared nearest neighbor is introduced. Experiments show that the improved algorithm performs better than the original one.Design and implement a document clustering system using the algorithm mentioned above. Each algorithm in the system is contrasted and evaluated through experiments.

Keywords/Search Tags:

Document Clustering, k-means, bisecting k-means, Shared Nearest Neighbor

PDF Full Text Request

Related items

1	Research On Video Super-resolution Based On Bisecting K-means Clustering And Improved Nearest Feature Line
2	A Fast And Efficient Parallel Bisecting K-Means Algorithm
3	Study Of Chinese Text Clustering On Improved K-means Algorithm
4	The Application Research Of Incremental Clustering For Document Update Sumarization
5	Research On Approximate Nearest Neighbor Search Based On Query-directed Graph
6	Stacked Hashing Quantization Algorithm For Nearest Neighbor Search
7	Research And Application Of Bisecting K-means Algorithm Analysis Based On Financial Customer Signature
8	Research On Improved K-means Clustering Algorithm Based On Density
9	K-NN, K-means And The Application In Text Mining
10	Improved K- Nearest Neighbor Classification