Font Size: a A A

Improved K-Means Algorithm Based On K Nearest Neighbor And Density Peak Algorithm

Posted on:2022-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:W W WangFull Text:PDF
GTID:2518306602470644Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid development of Internet technology has brought great convenience to people,who are more and more closely connected with the Internet to the point that they cannot be separated.People use the Internet to upload,download a variety of data,such as text,tables,pictures,sounds,videos,etc.In the long run,this has resulted in an exponential growth of data.Data mining technology that can mine high-value information from messy and a large number of low-value density data sets has been highly valued by academia and technology.In data mining technology,K-means algorithm,with its outstanding advantages,has become the pillar of the top flow in its core plate--clustering algorithm.The main process of the K-means algorithm is as follows:randomly acquire K center points of the data set;calculate the similarity of different data objects in the data set;divide k data clusters,then average the k clusters,and update the center points,redivide the data set,the cycle is repeated until the central point set is no longer changed,so that k clusters with high similarity in the cluster and low similarity among clusters are obtained,thus completing the clustering of the entire data set.Through the analysis and study of the algorithm,it is not difficult to find that the algorithm has the following obvious shortcomings:it is not accurate to only use the Calculation method of Euclidean distance to measure the similarity of data;The selection of clustering number K is subjective and empirical.The selection of cluster centers is random.In addition,it is found in the experiment that the original K-means algorithm is inefficient for high-dimensional data aggregation class.This article focuses on the defects of the four points of the K-means algorithm and makes the following corresponding improvements:(1)The shannon entropy was used to calculate the similarity between data points more accurately.The characteristic attributes of different data sets are different,and the influence of characteristic attributes on the similarity calculation of data points is greater.However,when Euclidean distance is used to calculate the similarity of data points,no distinction is made between characteristic attributes and non-characteristic attributes of each data set.In this paper,the shannon entropy is introduced to give higher weights to the characteristic attributes of the data set so as to measure the similarity between the data more accurately.(2)Use the peak method with CH as the index to determine the optimal cluster number k of the data set.According to the confirmed empirical rules,the optimal clustering number range of all data sets is[2,(?)n],where n is the total number of data objects in the data set.In this paper,by setting different k values for the data set to be clustered within this range,for each specific k,the original K-means algorithm is run 10 times,and the largest CH value is taken as the CH value corresponding to the current k value,and finally use k as the abscissa,and use the corresponding CH as the ordinate to draw a line graph,and select the k value corresponding to the peak point to be the best cluster number.Since the original K-means algorithm will greatly reduce the clustering on high-dimensional data sets,the dimensionality reduction process will be performed in advance when experimenting with high-dimensional data sets.(3)Use the improved density peak algorithm(referred to as DPC algorithm)to search for the best cluster center of the data set.The density peak algorithm can well find the cluster centers of various clusters in various complex data sets.In this algorithm,the value of the local density parameter dc of the data set is usually determined based on experience,and it does not fully consider the different data sets.The data distribution around each data point,and the cluster center is manually selected.In this paper,aiming at these two points,on the basis of the original algorithm,the idea of K-nearest neighbor algorithm is integrated,and the distribution of each data point and its surrounding K neighbor data points is considered to improve the definition of parameter dc,combined with the peak method to get The optimal number of clusters automatically selects the cluster center of each cluster.At the end of this article,summarized the research content of the whole article,and further explained the problems that still existed in the research process and the content that needs to be improved.
Keywords/Search Tags:K-means algorithm, shannon entropy, improved DPC algorithm
PDF Full Text Request
Related items