Font Size: a A A

Dimension Reduction And Multiple Kernel K-Means Algorithm For Text Clustering

Posted on:2014-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:H DengFull Text:PDF
GTID:2268330401485892Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of network and the continuous progress of information technology, all kinds of data expand at a surprised rate. The growth of text data is the most significant among them. How to find useful information from these massive amounts of text data and to classify it is becoming an urgent demand. K-Means clustering algorithm is widely used in the field of text clustering because of its simplicity and the advantages of fast convergence speed. Therefore, this paper was carried out as follows based on the K-Means clustering algorithm:The traditional K-Means clustering algorithm has the sensitivity and randomness of initial clustering center. So it easily falls into local optimal solution and has unstable result if the initial centers are not chosen suitable. Combined with the domestic and foreign related research methods, this paper proposed a K-Means algorithm of meliorated initial clustering center. It avoids the initial center from selecting on noise points and it also has benefits in merging the high density areas to expand the existing area for the centers within a certain scope. According to the density method and the ideas of the Maximum Minimum Distance (MMD), this algorithm firstly selects K pairs of high density points that have the maximal distance between each other, and then uses the K vertical centers of K pairs of high density points as the initial clustering centers to implement K-Means algorithm. Experiment verified the effectiveness of the proposed algorithm in the standard UCI data sets. Further more it clustered the Chinese text after the preprocessing dimension reduction. Experiments show that it can has a more stable results and better accuracy.Text data has the characteristic of high dimensions and sparseness. When it is clustered by the traditional K-Means algorithm, the traditional Euclidean distance metric is not effective to deal with the nonlinear data. And so that it can not form effective clusters and consume long time. This paper proposes dimension reduction and multiple kernel K-Means algorithm for text clustering. First it can solve the high dimension problem and then it can also solve the nonlinear and mussy data samples. Firstly in this algorithm the dimension reduction method of Principal Component Analysis is used to reduce the dimension of text data, and then the multiple kernel K-Means algorithm is applied to cluster the text of dimension reduction. This method obtains the optimal combination of kernel functions based on given kernel functions by solving a semi-definite programming problem. It improves the ability of kernel K-Means in handling the nonlinear text data. The experimental results show that this algorithm is superior to the traditional K-Means algorithm and the traditional single kernel K-Means algorithm.
Keywords/Search Tags:K-Means Algorithm, Text Clustering, Clustering Center, Dimension Reduction, Multiple Kernel Clustering
PDF Full Text Request
Related items