Research On Initial Cluster Centers Choice Algorithm And Clustering For Imbalanced Data

Posted on:2016-07-03

Degree:Master

Type:Thesis

Country:China

Candidate:P P Wu

Full Text:PDF

GTID:2308330482450645

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the continuous development of Internet technology, especially the rapid development of information technology, there produced a large number of data in different forms on the Internet, especially the imbalanced data. Imbalanced data refers that the number of some cluster is far less than the others in same data set. The great disparity of the number of samples is often accompanied by the great disparity of density in different clusters. Though k-means clustering algorithm is the most widely used, it chooses isolate point as the initial cluster center occasionally what badly influences the execution of the algorithm. Therefore, how to choose the initial cluster centers appropriately becomes an urgent problem. Using the k-means clustering algorithm versus imbalanced data, we can get a high accuracy in the clusters which have large number of samples but a low accuracy in the clusters which have a small number. In the imbalanced datasets, the cluster with a small number of samples often has more information. So it has the important significance to find out those samples accurately.Aiming at k-means type algorithm, the thesis focus on the research on the initial cluster centers choice method and the computational method of similarity of clusters in imbalanced data sets. And the following results are obtained.(1) Combining a local scale parameter in spectral clustering algorithm with the max-min distance algorithm, we propose an initial cluster centers selecting algorithm based on sparsity and distance simultaneously. We considered the distribution of samples which distribute around initial centers and the distances between initial centers aiming to obtain the initial centers. After all, we applied the proposed algorithm to the selection of initial cluster centers of the k-means and the fuzzy k-means clustering algorithms. The experimental results on both UCI datasets and real datasets have shown that the proposed algorithm is effective and feasible in real applications.(2) From focusing the average linkage in the measurement of cluster similarity perspective, we propose a computational method of cluster similarity and the cluster merging algorithm based on cluster similarity matrix. When we calculated the similarity of cluster, we considered the sparsity of samples before calculated the average similarity between all the samples. After all, we combined the proposed algorithm to the modified selection of initial cluster centers of the k-means and the fuzzy k-means clustering algorithms. The experimental results on imbalanced datasets have shown that the proposed algorithm is effective and feasible in real applications.We have done some studies on the initial cluster centers selecting algorithm of k-means algorithm and clustering of the imbalanced data sets, then we propose the Max_Min_SD and M_C_SA algorithm. The experiment results show that the algorithms are effective. Meanwhile, there is much improvement and discussion we should do. For example, the reason why k-means algorithm has a higher degree of dependence on the initial cluster centers than fuzzy k-means algorithm. Our work is just a beginning, and in-depth work needs to be further developed.

Keywords/Search Tags:

imbalanced data, initial center, Max -Min distance algorithm, k-means algorithm, cluster merging

PDF Full Text Request

Related items

1	Design And Implementation Of Initial Cluster Center Selection Algorithm For Categorical Matrix-object Data
2	Research On The Selection Of Initial Cluster Centers In K-means Algorithm
3	Improved K-means Algorithm Based On Optimizing Initial Cluster Centers
4	Improvement And Application Of K-means Algorithm
5	Research And Application Of K-means Clustering Algorithm
6	Design And Implementation Of Initial Cluster Center Selection Algorithm Based On Coupling Attribute
7	Research On Problems Related To The Initial Center Selection In K-means Clustering Algorithm
8	Study On Problems To Select Initial Cluster Centers Of The K-means Algorithm
9	The Research Of The K-means Clustering Algorithm Based On Nearest Neighbors
10	The Research Of Multi-clusters Ib Algorithm For Imbalanced Data Set