Font Size: a A A

The Research And Application Of Text Clustering Based On Improved K-means Algorithm

Posted on:2019-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:M LiFull Text:PDF
GTID:2348330542981790Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of the Internet,the text data are experiencing an explosive growth,which arouses a topic worth studying---how to obtain valuable information from these large-scale texts timely and accurately.In this context,text clustering technology is used to organize a large number of text information and extract important features.Therefore,the human force of document collations is reduced,and the efficiency of document retrieval is improved.So we can say that this topic has a far-reaching application prospect and is of great researching significance.The k-means algorithm is widely used in text clustering because of its simplicity and fastness.However,k-means algorithm has some shortcomings.The algorithm requires the user randomly input k initial clustering centers.The stochastic selection of the initial clustering center will affect the stability of the clustering and make the algorithm converge to the local optimum.The choice of k value is often given according to the users'professionalism.The selection of k value directly affects the result of clustering.In this paper,two shortcomings of k-means algorithm are studied and optimized,and the optimized algorithm is applied to text clustering.The main contents of this paper are as follows:First,clustering by fast search and find of density peaks?CFSFDP?is a density-based clustering algorithm,which is novel,simple and efficient,and can cluster various types of point sets.However,when the data set have a cluster which has a plurality of peak density,the clustering result is not ideal.In order to solve this problem,this paper proposes an optimized algorithm based on the boundary samples?M-CFSFDP?.The algorithm determines whether the clusters are merged according to the number of boundary samples of neighboring clusters.Using the representative data set selected as the test data set,the M-CFSFDP algorithm can be correctly clustered on the test data set.Comparing the clustering results of the original algorithm with the M-CFSFDP algorithm,the improved algorithm is more accurate than the CFSFDP algorithm when a cluster which has a plurality of peak density.Second,the algorithm will be affected by the cut-off distance when calculating the local density of the data points.According to the concept of potential energy in physics.On this basis,assuming that the data set is a potential energy field,all data points will have an impact on any other data points,so we can calculate the potential energy of each data point.In the data domain,the points with larger potential energy are located in the dense area,which is consistent with the density distribution of the original data.So the potential energy of the data points and the density of the data points have the same effect.Therefore,the paper proposes an optimization algorithm?P-CFSFDP?which replaces the data points with the potential energy.Experiments on classical simulated data set show that the improved algorithm not only can well represent the local density of each data point,but also can well find the initial cluster center from decision graph.Third,the k-means algorithm randomly selects the initial clustering center for iterative operation,which easily leads to the instability of the clustering results and converge to the clustering local optimum.The K-means algorithm?KP-CFSFDP?based on density peak optimization initial center is proposed.KP-CFSFDP algorithm uses P-CFSFDP algorithm to select the initial clustering center,and then use k-means algorithm for iterative clustering.Experiments on UCI datasets show that the KP-CFSFDP algorithm solves the problem that caused by K-means algorithm.Fourth,for the shortcoming of the input k value is randomly selected for the K-means algorithm.In this paper,an optimal clustering number determination method based on KP-CFSFDP algorithm is proposed?IKP-CFSFDP?.The algorithm can automatically determine the optimal clustering range.The algorithm first determine the maximum number of cluster(kmax),and use kmax for search upper bound.Then,we use the KP-CFSFDP algorithm to iterative clustering,and use DB?Davies-Bouldin?and SiL?Silhouette?as the clustering validity evaluation index.We use the index value to determine the best number of clusters.Theoretical analysis and experiments on UCI datasets show that the IKP-CFSFDP algorithm can not only determine the optimal clustering number,but also choose a good initial clustering center,which is very effective.Fifth,the paper applies the IKP-CFSFDP algorithm to the example of text clustering.Using Sogou text corpus as experimental data.Use NLPIR for Chinese word segmentation,removing stop words,and extracting feature words.The paper use TF-IDF to calculate the weight of the feature word,and establish the text representation VSM model,and combine the IKP-CFSFDP algorithm for text clustering.Experimental results show that the improved algorithm has higher precision and better stability.
Keywords/Search Tags:Text Clustering, K-means Algorithm, Clustering by Fast Search and Find of Density Peaks, Boundary sample, Initial Clustering, Optimal Clustering Number
PDF Full Text Request
Related items