Font Size: a A A

Research On Improvement Of K-means Clustering Algorithm

Posted on:2017-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:J L SongFull Text:PDF
GTID:2308330485964129Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the popularization of computer network, our life and work contact with information and data more frequently, and the volume of data people make and use is more and more huge, we entered the era of big data. People connection with a large number of data resources, we need only a few of them or the hidden information. How to quickly and efficiently obtain the information required from massive data resources and find out the relationship and rules between data resources is an urgent research topic, Data Mining is a cross subject technology which is driven by this kind of demand. Data Mining technology attend to find out the potential and effective and valuable information or knowledge from large amounts of data resources, so as to better understanding and application of the effective information hidden in the data, in order to provide help for making scientific decision or policy formulation. Clustering Analysis is a very important technology in the field of Data Mining, it’s widely used in many fields, such as image segmentation, e-commerce, market analysis, biology, geography and document classification. The basic principles of Cluster Analysis are as follows:partition a data set into a number of clusters without a prior knowledge, the attributes of the samples in the same cluster are as similar as possible, and the attributes of the samples in different clusters are as different as possible. Among the clustering algorithms, the clustering algorithm based on division has become one of the most widely used algorithms because the basic principle of it is simple, and it’s easy to realize, and it’s conducive to clustering for large data sets. The K-means clustering algorithm is the most representative one. However, the traditional K-means algorithm has many shortcomings as follows:such as it’s need to give a value of clustering number rely on experience, and randomly selected k initial cluster centers; besides, the clustering result depend on the initial clustering centers and the value of k, and sensitive to outliers and noise samples. Considering the above shortcomings, this paper proposes an improved k-means clustering algorithm which optimizing the selection of clustering centers, and another optimizing k-means algorithm to get the optimal value of clustering number k, and the experiments verified the validity of the improved algorithms, the improvements of the two algorithms are as follows:1. Considering the clustering results of the k-means algorithm is affected by the initial clustering centers and abnormal data, which caused the unstable clustering result and convergent to the local optimal. An improved algorithm is proposed, which select k samples in dense regions as the initial cluster centers. The improved algorithm proposed a parameter named m-dist to represent the density of each sample of the data set, selected k samples as the initial cluster centers which in the high density region and relative dispersion. The algorithm can effectively avoid selecting the outlier in the data set as the initial cluster centers, instead which can reduce the number of iterations and improve the accuracy of clustering result.2. The classic k-means algorithm needs a given value of k which represents the number of clustering by experience; the determination of k is subjective which probably caused an error clustering result. In this paper, an improved method of determining the optimal number of clusters is proposed. About the improved algorithm, firstly, have a candidate set U of initial clustering centers that the samples selected from the high density regions; secondly, searched for the best cluster number according to the proposed clustering validity index, namely the value of AIBWP; finally, when the value of AIBWP is optimal, then the corresponding number of clusters is the optimal clustering number.
Keywords/Search Tags:Clustering Analysis, K-means Algorithm, Initial Clustering Centers, Optimal Clustering Number, Density of samples
PDF Full Text Request
Related items