Font Size: a A A

Research On Initialization Method Of Dividing Clustering Algorithm

Posted on:2015-06-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y E WangFull Text:PDF
GTID:2208330434951524Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the development of computer hardware and software, computer storage devices can store more and more information. Now it is becoming more and more difficult to find useful information from the vast amount of data. So data mining as a technology, can help people to search for useful information from huge datasets. From a mass of data, data mining can found hidden potential useful patterns or uncover rules. Cluster analysis is an important part of data mining, which has been studied by the domestic and foreign scholars. Clustering analysis is an unsupervised learning of a data mining technique, based on the thought of birds of a feather flock together", which can find groups in data. Such groups are called clusters, the objects are similar in the same cluster to each other, and objects are as dissimilar as possible in different groups. Clustering analysis can be divided into the six kinds of clustering algorithms, namely partitional clustering algorithm, density-based clustering algorithm, grid-based clustering algorithm, hierarchical clustering algorithm, model-based clustering algorithm.This article mainly research the partition clustering algorithm, we study literature about the partition clustering algorithms at home and abroad, according to its existing problems, puts forward the improved algorithm. In this paper, main work is as follows:1. K-means algorithm based on minimum variance optimization. The traditional K-means algorithm is a random selection of the initial clustering center, the result of clustering is affected by the data input order, even if the data input sequence must be same, the result may be different. The random selection of cluster center can also lead to instability of clustering results. Some existing improved K-means algorithm can select the better initial clustering center which conforms to the original distribution, but when those algorithms need some parameter values while selecting the initial clustering center. In general, these parameter values have no rules to follow, in fact, how to select the value of those parameters, we require a certain experience. This algorithm is based on the minimum variance of data objects and the mean of all the data objects distance to select the optimal initial clustering centers, and in the selection of the initial clustering center we do not need to input parameters. The improved algorithm is very good to solve the problem when choosing the initial clustering center, which can objective selected in accordance with the initial clustering center of the distribution of original data sets.2. K-medoids based on minimum variance optimization algorithm. K-medoids algorithm can overcome the shortcoming of the traditional K-means algorithm which is sensitive to some noises, it still exists in the initial clustering randomization and the bad scalability due to update the clustering centers in all the objects. clustering center in the update is used to evaluate all the center point of principle, so the traditional K-medoids cost the more time, when it deal with the huge datasets, the result can be got. The paper will propose K-medoids algorithm based on the minimum variance optimization. According to the minimum variance of data objects and the distance mean of all data objects, the algorithm choose the optimal initial clustering centers, and makes the choice of clustering center as much as possible in accordance with the distribution of the data set of the original cluster center. When update clustering center, the algorithm search the best candidate object in the local dataset to update the last center, so the speed of the algorithm is accelerated, and the scalability of the algorithm can be strengthen to handle large data sets, at the same it can improve the execution efficiency of the algorithm.3. Clustering validity evaluation criteria. Clustering algorithm is a technology of analysis data set, we often evaluate the result of the alogrithms to decide whether the alogrithms is better. We expect the results of clustering datasets can reveal the natural distribution of original dataset, or it can meet people’s expectations. So the clustering validity evaluation criteria is the key for clustering analysis. This paper summarizes the commonly used several kinds of internal validity evaluation criterion and the external validity evaluation criterion, and compares the effectiveness of internal standard, and analyses the characteristics of several kinds of internal validity criteria. At the same time, this paper introduces several kinds of commonly used external validity criterion, while we analyze those criterions and put forward several new external validity indexes those new indexes can more effectively reveal the result of the algorithms and reflect the real distribution of original data set and avoid deflection phenomenon of the different clusters.
Keywords/Search Tags:Clustering algorithm, minimum variance, K-means algorithm, K-medoidsalgorithm, cluster analysis, clustering analysis effectiveness index
PDF Full Text Request
Related items