Font Size: a A A

Improvements And Implementation Of K-means Clustering Algorithm

Posted on:2016-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:Q R DongFull Text:PDF
GTID:2298330467997356Subject:Data mining
Abstract/Summary:PDF Full Text Request
Clustering analysis is an important research field of data mining. It is anunsupervised learning process, in which one can automatically recognize dense andsparse regions of object space through the unsupervised method. Nowadays, in the ageof big data, clustering analysis has become one of the research hotspot of machinelearning and data mining.K-means clustering algorithm is a classical algorithm based on partitioning inclustering analysis. It is quite simple and adaptable, and can be carried out on a varietyof data types. At the same time, because of its scalability, k-means clustering also canbe used for efficient processing of large data sets. Therefore, k-means clusteringalgorithm is still the focus of the clustering algorithm research field. The main problemof this algorithm exists that the number of clustering is uncertain, which directlyinfluences the effect of clustering. In addition, the clustering result excessively dependson initial center point settings, since different initial center points make great influenceon the stability of clustering results. This paper introduces certain improvement aimingat these shortcomings of k-means algorithm.Firstly, this paper introduces the basic background of clustering analysis, simplyanalyzes the development and current situation of it, presents proper conditions thatexcellent clustering method should possess, and lists some typical algorithm inclustering analysis.Secondly, this paper makes a more comprehensive introduction about classicalalgorithm k-means of clustering analysis, including the realization, the advantages anddisadvantages of the algorithm. With the comparison of the current popular clustering methods, it points out that k-means algorithm exists apparent deficiencies on thedetermination of the optimal cluster number. Therefore, it becomes natural to putforward the idea of improving k-means algorithm, which can better determine theoptimal cluster number and improve its applicability and availability.Then, aiming at another problem about selecting the initial center point, somework undergoes improvements. When analyzing several popular methods ofimprovement, I found that it all adopts random selection method. actually, it is unableto avoid the unstable condition of clustering effect. This paper presents differentimprovement scheme, in which one can select initial center point based on the featuresof the data. After the experiment, the initial center point selected can effectively reducethe number of iterations of the clustering algorithm, and improve the efficiency ofclustering algorithm. In the mean time, clustering results and the number of iterationshas good stability.Finally, according to difference measurement principle, this paper proposes theimproved clustering algorithm based on weight value. The distinction of differentdimensions of data realizes different effect of different data for the clustering results.The experimental results show that clustering accuracy rate has been improved to someextent. At the same time, improved work combines with the method of confirmation ofoptimal cluster number and the majorization of initial center so that the algorithm isfrom determining the cluster number to obtain the final clustering results "automation"and improves its practicability. Because clustering process highlights the data whendetermining the number of cluster, it is to the benefit of high efficiency of k-meansalgorithm. Experiment on the standard data sets demonstrates that the improvedalgorithm enhances the accuracy of clustering results, and shows ideal stability.The innovation of this algorithm lies in the aspect of data processing. Through theanalysis of the data, one can determine optimal clustering number and the initial centerpoint, at the same time through the adjustment of the weight value, one can distinct the importance of data in different dimensions, which avoids the measurement of similarity.Through the theoretical analysis and the experimental data, it can be seen that we canget better clustering results in only one clustering process.
Keywords/Search Tags:clustering analysis, k-means, number of clusters, initial cluster centers, weight
PDF Full Text Request
Related items