Dimension Reduction And Multiple Kernel K-Means Algorithm For Text Clustering

Posted on:2014-06-08

Degree:Master

Type:Thesis

Country:China

Candidate:H Deng

Full Text:PDF

GTID:2268330401485892

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of network and the continuous progress of information technology, all kinds of data expand at a surprised rate. The growth of text data is the most significant among them. How to find useful information from these massive amounts of text data and to classify it is becoming an urgent demand. K-Means clustering algorithm is widely used in the field of text clustering because of its simplicity and the advantages of fast convergence speed. Therefore, this paper was carried out as follows based on the K-Means clustering algorithm:The traditional K-Means clustering algorithm has the sensitivity and randomness of initial clustering center. So it easily falls into local optimal solution and has unstable result if the initial centers are not chosen suitable. Combined with the domestic and foreign related research methods, this paper proposed a K-Means algorithm of meliorated initial clustering center. It avoids the initial center from selecting on noise points and it also has benefits in merging the high density areas to expand the existing area for the centers within a certain scope. According to the density method and the ideas of the Maximum Minimum Distance (MMD), this algorithm firstly selects K pairs of high density points that have the maximal distance between each other, and then uses the K vertical centers of K pairs of high density points as the initial clustering centers to implement K-Means algorithm. Experiment verified the effectiveness of the proposed algorithm in the standard UCI data sets. Further more it clustered the Chinese text after the preprocessing dimension reduction. Experiments show that it can has a more stable results and better accuracy.Text data has the characteristic of high dimensions and sparseness. When it is clustered by the traditional K-Means algorithm, the traditional Euclidean distance metric is not effective to deal with the nonlinear data. And so that it can not form effective clusters and consume long time. This paper proposes dimension reduction and multiple kernel K-Means algorithm for text clustering. First it can solve the high dimension problem and then it can also solve the nonlinear and mussy data samples. Firstly in this algorithm the dimension reduction method of Principal Component Analysis is used to reduce the dimension of text data, and then the multiple kernel K-Means algorithm is applied to cluster the text of dimension reduction. This method obtains the optimal combination of kernel functions based on given kernel functions by solving a semi-definite programming problem. It improves the ability of kernel K-Means in handling the nonlinear text data. The experimental results show that this algorithm is superior to the traditional K-Means algorithm and the traditional single kernel K-Means algorithm.

Keywords/Search Tags:

K-Means Algorithm, Text Clustering, Clustering Center, Dimension Reduction, Multiple Kernel Clustering

PDF Full Text Request

Related items

1	Research On Text Clustering Based On Text Dimension Reduction And Ant Colony Algorithm
2	The Research And Application Of Text Clustering Based On Improved K-means Algorithm
3	Researching The Kernel Clustering Algorithm And Its Application In Text Clustering
4	Researches In Kernel-based Fuzzy C-Means Clustering Algorithm Based On GA Optimization
5	Clustering, Based On The Chinese Text Of The Som Algorithm
6	Research And Implementation Of Text Clustering Based On Fuzzy C-Means Clustering Algorithm
7	Precise Clustering Algorithm For Chinese Text Based On K-means
8	The Research And Application Of Multiple-exemplar Clustering
9	Research On Text Clustering Problems Of Kernel Function And Self-definite Category Number
10	Fuzzy C-means And K-means Clustering Algorithm And Its Parallel