Font Size: a A A

Study On Improvement Of K-means Clustering Algorithm

Posted on:2019-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:J GuoFull Text:PDF
GTID:2428330563456424Subject:Public Security Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology,the ability of humans to collect,store,transmit,and process data has rapidly increased.Every corner of human society,such as commerce,society,science,engineering,medicine,and daily life,has accumulated a large amount of data,and it requires a powerful and universal tool to effectively analyze and utilize data.Machine learning and data mining are exactly in line with this urgent need in the data era.They have achieved great development and received extensive attention.Cluster analysis is widely used in the fields of machine learning,data mining,pattern recognition and image processing.There are many kinds of clustering algorithms.K-means clustering algorithm has become one of the most widely used clustering algorithms because of its simplicity,high efficiency,and adaptability.However,the traditional K-means clustering algorithm has two relatively outstanding problems: one is the selection of the initial clustering centers,and the other is the determination of the number of clusters.For the initial clustering centers selection problem,a new K-means initial clustering centers optimization algorithm based on Principal Components Analysis(PCA)was proposed by referring to Rahman's Sum Score algorithm.The method first uses principal component analysis to reduce the original multidimensional data into one-dimensional data.Second,sort one-dimensional data in ascending order.Third,the sorted data is divided into k subsets.Then,the multidimensional data is divided into k subsets by the correspondence between one-dimensional data and multi-dimensional data.And find the center of the k subsets.Finally,the k nearest data points from the center of k subsets in the original multidimensional data are taken as the initial cluster centers.Compared with other optimization algorithms on artificial datasets and UCI datasets,the experimental results show that the new algorithm can significantly improve the quality of clustering.For the problem of how to determine the number of clusters,a new method for determining the optimal clustering number of K-means based on potential stability is proposed.This method makes use of the potential stability of the dataset,selects two different initial cluster centers for clustering,and then compares whether the two clustering results are the same to obtain the optimal clustering number.And compared with themethod of determining the optimal clustering number based on the cluster validity internal evaluation index on the UCI datasets,the results show that the new method can obtain the correct number of clusters more accurately.
Keywords/Search Tags:K-means clustering algorithm, principal component analysis, initial clustering centers, potential stability, number of clusters
PDF Full Text Request
Related items