Font Size: a A A

Research Of Improved K-means Algorithm And Cluster Validity Index Based On Three-way Decision

Posted on:2022-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y XiaFull Text:PDF
GTID:2480306542462884Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Cluster analysis can mine the internal characteristics of unlabeled samples and divide the data in datasets into several subsets autonomously,unknowing the data labels and the number of data classifications,which is an extremely important unsupervised learning method.So far,this method has been widely used in many fields closely related to human life,such as decision making,speech recognition,pattern processing,etc.For cluster analysis,the selection of a good clustering algorithm and a reasonable optimal number of clusters(Kopt)both has a vital influence on the final clustering result.However,existing clustering methods still have several problems in solving the above two key factors,for example,the optimal number of clusters Kopt cannot be determined,it is sensitive to outliers and noise points,processing all samples uniformly may lead to poor final clustering quality,and cluster validity index(CVI)cannot effectively evaluate clusters of arbitrary shape,etc.In order to improve the shortcomings of some existing clustering methods,the main work of this thesis is as follows:1.Through combining the three-way decision theory with the K-means algorithm tightly,this thesis proposes a new clustering algorithm TK-means.This algorithm divides the data space into core area and edge area for separate clustering,effectively solving the inaccurate clustering problems of traditional K-means algorithm,caused by processing all samples uniformly;At the same time,this thesis introduces the idea of grid division in the grid algorithm and uses the grid density to quickly determine the core and edge points,avoiding to calculate all samples,finally improves the efficiency of algorithm,and a new method for determining the initial clustering center is also proposed in this thesis,through combining the density method with the roulette wheel method;Aiming at the problem of poor clustering of irregular datasets,this thesis proposes the principle of dividing the core points near the centroids while dividing the edge points near the core points,and this new division method improves the processing ability of irregular datasets.2.Through evaluating the core area and the edge area separately,this thesis proposes a new cluster validity index TCVI.The proposal of this index effectively avoids the problems that some existing indices regard all sample spaces as a unified whole and enlarge the possible adverse effects of edge area.The TCVI index measures the clustering quality by analyzing intracluster compactness and inter-cluster separation of the clustering result.Regarding the measurement of intra-cluster compactness,this index is still based on the three-way decision theory and proposes different calculation rules for core area and edge area;Regarding the measurement of inter-cluster separation,this index proposes to use the inter-cluster separation of core area to represent the inter-cluster separation of entire sample space,so as to achieve a more effective evaluation of the clustering result and determine the optimal number of clusters Kopt by reducing the influence of edge area.3.In order to verify the effectiveness of the improved algorithm TK-means and the new cluster validity index TCVI proposed in this thesis,we conducted a lot of experiments on simulated datasets and real datasets.The results verify that our proposed TK-means algorithm is better than comparison methods in the evaluation of clustering effect and common cluster validity indices,and the TCVI index is better than the comparison indices in evaluating performance and stability.
Keywords/Search Tags:clustering algorithm, three-way decision, core point, edge point, cluster validity index
PDF Full Text Request
Related items