Font Size: a A A

Research On K - Medoids Algorithm And External Clustering Evaluation Index Of Variance Optimization Initial Cluster Center

Posted on:2016-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:R GaoFull Text:PDF
GTID:2208330473461422Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As an important unsupervised learning method, cluster analysis is an important data mining technology. With a substantial emergence of big data, clustering algorithms are concerned by the researchers. Partitioning algorithm is the most commonly used among clustering algorithm, however, how to effectively determine the initial clustering seeds becomes the key problem in partitioning-based clustering; external evaluation indices are the most common used clustering evaluation measures, however when the clusters are imbalanced, traditional external evaluation indices are failure. How to evaluate the imbalanced clusters urgently needs to solve. In order to find a kind of effective method to choose initial seeds in K-medoids clustering algorithm and overcome the deficiencies of traditional external evaluation indices of their failure to clusters imbalanced, this thesis puts forward to the main work and innovations as follows:(1) The optimized K-medoids clustering algorithm by the variance of the Num-near neighbors is put forward. The new algorithm adopts the local variance to select the initial seeds for K-medoids, so that the initial seeds are as close as possible to the optimal initial seeds. At last the examplars in the dense areas are chosen as initial seeds for K-medoids. UCI datasets and the experiment of artificial simulation datasets show that the algorithm has good clustering effect, strong anti-noise performance, and is suitable for large-scale data set of cluster analysis."(2) The variance-based K-medoids clustering algorithms are proposed, with the mean distance between samples and the standard deviation of the specific samples as the radius of neighborhood. Samples are selected as the initial seeds with minimum variance, where at least the distance between initial seeds is the radius of neighborhood. UCI datasets and the experiment of artificial simulation datasets show that the algorithm can spend less time get closer cluster structure, and fit for the large-scale data set. The experimental results on UCI machine learning repository and the synthetically generated datasets demonstrate that our proposed K-medoids clustering algorithms obtained better clustering, and they are scalable to cluster large scale datasets.(3) An external evaluation indices based on contingency table and two external evaluation indices based on sample-pair are proposed. In order to overcome the disadvantages that imbalanced clusters can’t be measured without specificity, external evaluation indices based on contingency table take sensitivity and specificity into account. Besides, the new indices are dependent on specific distributed datasets. By means of the sample-pair, the sensitivity, specificity and precision are defined and the new indices based on sample-pair are defined by the new sensitivity, specificity and precision. We present an empirical investigation of several different clustering evaluation measures when used to assess partitions generated from UCI datasets and artificial simulation datasets. Results show that the new indices based on contingency table can be used to evaluate the imbalanced clusters and new indices based on sample-pair evaluate the clustering more objectively. Especially the new external indice based on sample-pair taking specificity into account is a relative ideal measure.
Keywords/Search Tags:K-medoids clustering, Initial seeds, Neighborhood, Local variance, Num-neighbor, Standard deviation, Specificity, Sample-pair, External evaluation indices
PDF Full Text Request
Related items