Font Size: a A A

Research On Effective Internal Index Framework For Cluster Evaluation

Posted on:2021-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhangFull Text:PDF
GTID:2428330620465630Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the past few decades,researchers have proposed a large number of clustering validity indexes suitable for hard clustering.However,the existing validity methods are affected by data characteristics.For example,noise,density,geometry shape,etc may affect the performance of internal index.In view of the above problems,this thesis analyzes the main factors that affect the performance of clustering algorithm and further studies clustering validation,and proposes three new internal indexes.The main work content is depicted as follows:(1)To overcome the disadvantages of the existing measurements as the intra-cluster compactness for the single-linkage agglomerative hierarchical clustering,this thesis uses the longest edge of the minimum spanning tree as inter-cluster compactness,and put forward a synthetical clustering validity index(SCV)for single-linkage algorithm.According to the different statistical methods,this index can be divided into am-SCV and gm-SCV.(2)SCV index performs well in evaluating single-linkage algorithm,but it is not applicable to other hierarchical clustering algorithms.To this end,this thesis proposes a generalized synthetical clustering validity(GSCV)index.This index adopts the self-adaptive similarity measurement strategy to evaluate the clustering results,which avoids the performance degradation of the internal index caused by the incompatibility of the similarity measurement method between the clustering algorithm and the internal index.According to different statistical methods,GSCV index can be divided into am-GSCV and gm-GSCV.This thesis verifies the performance of the new indexes on 15 artificial datasets(with different dimensions,spatial distribution,overlap,and size)and 4 real datasets,and compare them with seven other commonly-used internal indexes.The experimental results show that SCV and GSCV index can accurately obtain the optimal clustering number of clustering results of different data sets with different density,skewness distribution and geometric structure.(3)SCV and GSCV index can be unified into a hierarchical clustering validityframework(HCVF)to evaluate hierarchical clustering algorithms.However,HCVF is based on the hierarchical structure generated by hierarchical clustering algorithms,so the framework can only be used to evaluate the clustering results generated by hierarchical clustering algorithms.To solve this problem,this thesis extends the subclass concept so that the new index can be applied to non-hierarchical clustering algorithm.In addition,this thesis introduces graph theory to improve the HCVF framework,which can not only capture the spatial structure of the data set,but also reduce the time complexity of using the new clustering validity index.The improved graph-based clsutering validity index(GBCV)inherits the advantages of the HCVF framework.Moreover,it is suitable for non-hierarchical clustering algorithm and greatly reduces the time complexity of using internal index.This thesis verifies the performance of the new indexes on 12 artificial datasets(with different dimensions,spatial distribution,overlap,and size)and 6 real datasets,and compare them with seven other commonly-used internal indexes.The experimental results show that SCV and GSCV index can accurately obtain the optimal clustering number of clustering results of different data sets with different density,skewness distribution and geometric structure.
Keywords/Search Tags:Clustering, internal index, Cluster validity index, optimal number of clusters
PDF Full Text Request
Related items