Font Size: a A A

Research And Application Of The Improved K-means Clustering Algorithm

Posted on:2018-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:F F WangFull Text:PDF
GTID:2348330518466694Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
Data mining is the process of extracting deep and valuable information from a large amount of data.The application of data mining involves a variety of techniques,including clustering,classification,association and prediction.And clustering analysis is an important direction of data mining,which is a process of dividing the data set into some incompatible subsets.At present,cluster analysis has been widely used in many fields,such as web searching,artificial intelligence,information retrieval,image pattern recognition,spatial database technology and marketing management and so on.Now,the well known and widely used clustering methods mainly includes partition method,hierarchical method,density-based method,grid-based method and model-based method.The k-means algorithm is a commonly used clustering algorithm,which has the advantages of simple principle,understanding and realizing easily,and processing large data sets conveniently.And given the training data set and the number of cluster,the algorithm can repeatedly cluster the data set according to the criterion function until the function no longer changes or reaches the agreed threshold.But,the algorithm also has some shortcomings that the number of clusters need to be given in advance,the clustering results are sensitive to the selected initial center points and the noise points in the data set,and clustering results may be local optimal solutions and so on.This paper makes improvements for the k-means algorithm in three aspects that the number of k values need to be given in advance,the selection of the initial center and the outliers will make an great influence on the clustering results,and proposes an improved k-means clustering algorithm based on the maximum-minimum distance.When using the maximum-minimum distance method,the algorithm firstly breaks the interval including the parameter value of theta into some smaller intervals according to the idea of the divide and conquer algorithm.Secondly,it chooses different theta belonging to each small interval to cluter and removes the intervals that cluster results are not good,At last,it discretes the remaining intervals according to the idea of continuous attribute discretization and lets theta be the endpoint values of discreted intervals and carrys out cluster analysis respectively.The clustering results were measured by the mean of 95% of the ordered BWP index values,the larger the mean,the better the clustering effect.In a word,the improved algorithm solves the problems that the clustering number of the k-means clustering algorithm needs to be given in advance and clustering results are sensitive to the initial clustering center points and the abnormal points.In order to verify the effectiveness of the improved algorithm,three data sets in the UCI database are selected and clustered with different clustering algorithms,and the results show that the improved algorithm has a higher accuracy and better clustering effect.Finally,the article chooses part of data sets of Zhejiang telecom users as the object of study.On the one hand,it respectively makes use of the traditional k-means algorithm,the k-means algorithm based on the maximum-minimum distance and the improved k-means algorithm for cluster analysis,and the results show that the improved algorithm has a better clustering effect and the differences among categories are more pronounced.Meanwhile,it makes characteristics summarized analysis according to different categories,defines the category names,and makes different marketing plan in order to improve the quality of service.On the other hand,the article selects historical data set of telecom users to train the Logistic classification model on the basis of Logistic modeling method,and carrys on forecast of trunover rate for subdivision population above-mentioned,for the sake of doing a good job in staying the lossing customers in advance.
Keywords/Search Tags:improved k-means algorithm, BWP index value, user segmentation, Wastage rate
PDF Full Text Request
Related items