Font Size: a A A

Research On Initial Cluster Centers Choice Algorithm And Clustering For Imbalanced Data

Posted on:2016-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:P P WuFull Text:PDF
GTID:2308330482450645Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet technology, especially the rapid development of information technology, there produced a large number of data in different forms on the Internet, especially the imbalanced data. Imbalanced data refers that the number of some cluster is far less than the others in same data set. The great disparity of the number of samples is often accompanied by the great disparity of density in different clusters. Though k-means clustering algorithm is the most widely used, it chooses isolate point as the initial cluster center occasionally what badly influences the execution of the algorithm. Therefore, how to choose the initial cluster centers appropriately becomes an urgent problem. Using the k-means clustering algorithm versus imbalanced data, we can get a high accuracy in the clusters which have large number of samples but a low accuracy in the clusters which have a small number. In the imbalanced datasets, the cluster with a small number of samples often has more information. So it has the important significance to find out those samples accurately.Aiming at k-means type algorithm, the thesis focus on the research on the initial cluster centers choice method and the computational method of similarity of clusters in imbalanced data sets. And the following results are obtained.(1) Combining a local scale parameter in spectral clustering algorithm with the max-min distance algorithm, we propose an initial cluster centers selecting algorithm based on sparsity and distance simultaneously. We considered the distribution of samples which distribute around initial centers and the distances between initial centers aiming to obtain the initial centers. After all, we applied the proposed algorithm to the selection of initial cluster centers of the k-means and the fuzzy k-means clustering algorithms. The experimental results on both UCI datasets and real datasets have shown that the proposed algorithm is effective and feasible in real applications.(2) From focusing the average linkage in the measurement of cluster similarity perspective, we propose a computational method of cluster similarity and the cluster merging algorithm based on cluster similarity matrix. When we calculated the similarity of cluster, we considered the sparsity of samples before calculated the average similarity between all the samples. After all, we combined the proposed algorithm to the modified selection of initial cluster centers of the k-means and the fuzzy k-means clustering algorithms. The experimental results on imbalanced datasets have shown that the proposed algorithm is effective and feasible in real applications.We have done some studies on the initial cluster centers selecting algorithm of k-means algorithm and clustering of the imbalanced data sets, then we propose the Max_Min_SD and M_C_SA algorithm. The experiment results show that the algorithms are effective. Meanwhile, there is much improvement and discussion we should do. For example, the reason why k-means algorithm has a higher degree of dependence on the initial cluster centers than fuzzy k-means algorithm. Our work is just a beginning, and in-depth work needs to be further developed.
Keywords/Search Tags:imbalanced data, initial center, Max -Min distance algorithm, k-means algorithm, cluster merging
PDF Full Text Request
Related items