| In today's world,science and technology are advancing with each passing day.Modern information technologies such as the Internet,cloud computing,and big data have profoundly changed the way of thinking,production,life,and learning of human beings,and profoundly demonstrated the prospects for world development.The rise of the Internet and the continuous accumulation of information data have driven the entire society to the era of big data.Data plays an important role in all aspects of people's daily lives.Data from all walks of life are becoming more and more prosperous,and data has become today.The important influence factors of social development,and how to effectively deal with the ever-large data has become a difficult problem in the field of data mining.As a method of unsupervised learning,clustering algorithm is an important tool in data mining,and it has become a hot issue in scientific research.DPC(Density Peak Clustering)is a novel and effective density-based clustering algorithm published in Science in 2014.The density peak algorithm has unique advantages in dealing with clusters of different sizes and densities in other clustering algorithms,but the DPC algorithm still has some disadvantages:(1)In the clustering process,human participation is required to select clusters in the decision graph.Center point,some dataset cluster center points are not clear,which may lead to mis-selection or missed selection;(2)DPC algorithm is difficult to effectively process sample points in low-density datasets,which also causes DPC to fail to identify abnormal points;(3)The time complexity of the DPC algorithm is (9)),which also causes DPC to spend a lot of time processing large data sets,so it is not widely used in the era of big data.This paper proposes different improvements based on the above problems:(1)For the phenomenon that the density peak algorithm is not clear on some data sets,this paper proposes Gravitation-based Density Peaks Clustering(GDPC).The new algorithm uses the reciprocal of gravity as a parameter to replace the δ parameter in the DPC algorithm.By comparing the two algorithms,the decision graph generated by GDPC is easier to distinguish than the intuitive degree of DPC at the center point.(2)In view of the shortcomings of GDPC algorithm in the identification of abnormal points,this paper proposes a semi-supervised learning algorithm combining KNN and GDPC.The new algorithm uses KNN to classify unrecognized low-density points in lowdensity regions,so that GDPC algorithm can effectively identify abnormal points.(3)For the GDPC algorithm,the time spent on large data sets is too long.This paper proposes a k-GDPC algorithm combining k-Means and GDPC.The k-GDPC adopts the strategy of first partitioning and merging to quickly discover the spatial database.Cluster classes with different sizes and densities reduce the time consumption because they reduce a lot of data processing during the clustering process.The time complexity of the k-GDPC algorithm is linear with the amount of data,which can replace the GDPC algorithm when dealing with large data sets. |