Font Size: a A A

Research On Clustering Algorithm Based On Automatic Detection Of Density Peaks

Posted on:2021-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:S Q CuiFull Text:PDF
GTID:2428330626465639Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and intelligent industry,data has shown an exponential growth.How to extract valuable information from massive data has become a concern for all walks of life.Data mining,as a technical means to obtain effective information,has been widely concerned in recent years.While cluster analysis,as an important branch in the field of data mining,has also developed rapidly.Now it has been used in life sciences,image segmentation,pattern recognition and other fields.Clustering by fast search and find of density peaks(DPC)is a new density clustering algorithm published in Science by Alex Rodriguez et al.in 2014,which is brief,efficient,low parameter dependent,and adapted to the characteristics of non-convex data sets.Although the density peak clustering algorithm has been greatly improved compared to the previous algorithm,there are still some shortcomings:(1)There is no unified density measurement criterion,the corresponding density calculation formula needs to be selected according to the condition of the sample set,and when the density is equal,the problem of sample point allocation is not solved.(2)The selection of the cutoff distance d_c is more sensitive,and the smaller difference in cutoff distanced_c will seriously affect the sample density estimation.(3)Using Euclidean distance to define sample similarity is too simple,and there are limitations on complex data sets such as aspheric surfaces.(4)In the process of determining the cluster center points,manual selection is required,with certain subjectivity,and errors are prone to occur on the data set with low discrimination,resulting in poor clustering results.In response to the above problems,this article made the following improvements:1)Aiming at the inconsistency of density measurement criteria,the difficulty of sample point allocation when the density is equal,the manual selection of cluster center points and other defects,a new E-DPC algorithm is proposed.This algorithm uses the mathematical properties of the Gaussian function to optimize the density measurement formula and solve it by index.The sample points assignment's problem when the density is equal,and finally combined with the hypothesis test feature of the SH-ESD algorithm to automatically select the center points.The experimental results on the UCI standard set and synthetic set show that the optimized algorithm has better clustering effect.2)For the sensitive setting of the cutoff distanced_c,the similarity of the Euclidean distance is too simple.A new KE-DPC algorithm is proposed to manually select the cluster center points with defects such as subjectivity.The algorithm first combines KNN's neighbor information and Euclidean distance to optimize the similarity measurement criterion.After that,the local density calculation formula is redefined according to the number of K nearest neighbor samples to avoid the setting of sensitive cutoff distanced_c.Finally,it uses the linear regression fitting the sample distribution on the decision map to obtain the residual set,and then automatically obtain the cluster center points according to the characteristics of the residual analysis in ESD anomaly detection,eliminating the subjectivity of human selection.The comparison experiment among the optimized KE-DPC algorithm,K-means,DBSCAN,DPC,and other algorithms shows that the KE-DPC algorithm can determine the center points more accurately and obtain better evaluation results in various evaluation indicators.
Keywords/Search Tags:DPC, K-nearest-neighbor, Linear regression, Residual set, ESD anomaly detection
PDF Full Text Request
Related items