Font Size: a A A

Research Of Improved K-means Algorithm And New Cluster Validity Index In Cluster Analysis

Posted on:2020-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:P WenFull Text:PDF
GTID:2428330575454472Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The cluster analysis technique is an important tool,which discovers the natural structure of the data set autonomously and divides the data set into several clustering according to the law between data.As an unsupervised learning method,cluster analysis has been widely used in data mining,pattern recognition,image processing and other fields.Cluster analysis is mainly divided into the study of clustering algorithm and cluster validity index(CVI).However,in the era of big data,existing clustering algorithms and clustering effectiveness indexes have several problems,including low efficiency of the algorithm,poor accuracy of clustering results,sensitivity to noise points,and inability to process large-scale data sets efficiently and correctly.In view of the above problems,this thesis focuses on the improvement of the K-means algorithm and proposes a new cluster validity index for big data(BCVI-index).The main work of this thesis is as follows:(1)Aiming at the problem that the traditional K-means algorithm is less efficient in dealing with large-scale data sets,this thesis introduces the idea of meshing in grid algorithm into K-means algorithm to improve the efficiency of the algorithm,and proposes an improved algorithm called Grid-K-means;At the same time,we use the mesh density in the grid algorithm to solve the problem of the initial clustering center cannot be determined by the K-means algorithm;In order to avoid the problem that the grid algorithm needs to set too many parameters to divide the grid,this thesis uses the operation of the dynamic grid instead of the operation of the data points to improve the efficiency and accuracy of the improved Grid-K-means algorithm,and reduces the number of initial parameters that the clustering algorithm needs to manually set.The improved Grid-K-means algorithm has better stability,accuracy and robustness.(2)This thesis proposes a new cluster validity index BCVI for large-scale data sets.The BCVI index uses a weighted grid as a plurality of representative points to handle clustering of various shapes,avoiding the problem of excessive computational complexity caused by all sample points participating in the calculation.At the same time,multiple representative points can better evaluate the quality of clustering results than a single representative point.Finally,the separation between the clusters is determined by the combination of the minimum spanning tree and the largest spanning tree constructed by each clustering center.The addition of the maximum spanning tree among the clustering centers can better evaluate the degree of separation between clusters,balance the differences in compactness data within clusters,and ensure a more stable evaluation effect of BCVI index.(3)The BCVI index consists of a linear combination of intra-cluster compactness and inter-cluster separation.By analyzing the BCVI index characteristics,we can find that the monotonic characteristics of the BCVI index can quickly determine the optimal cluster number(Kopi).The time cost of BCVI in finding the optimal number of clusters(Kopt)is much lower than the usual method of using the empirical rule 2?K?(?).With this approach,BCVI can quickly determine the optimal number of clusters Kopt,especially for large data sets.(4)The improved algorithm Grid-K-means and the new cluster validity index BCVI are tested by using the simulated data set and the real data set.Experiments show that the Grid-K-means algorithm is faster and more accurate than the traditional K-means algorithm?K-medoids algorithm?K-means++algorithm and the improved K-means algorithm.At the same time,the comparison of BCVI indexes with the other seven existing indexes(DI-index?DBI-index.I-index?CH-index?COP-index?STR-index?VCVI-index)shows that the new BCVI index are superior to traditional indexes in terms of data processing speed and stability.
Keywords/Search Tags:Cluster analysis, Clustering validity index, Optimal clustering number, K-means algorithm, Grid computing
PDF Full Text Request
Related items