Font Size: a A A

Research On Path-based Of Clustering By Fast Search And Find Of Density Peak

Posted on:2020-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:G L YanFull Text:PDF
GTID:2428330599456768Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering is an unsupervised classification technology that divides data sets into different clusters based on similarities between data elements.By analyzing the clustering results and observing the distribution rules of different clusters,we can find the hidden rules behind the data.With the development of social information,cluster analysis plays an increasingly important role in social life and has been widely used in many fields.However,with the rapid growth and diversification of data scales,clustering algorithm research is facing a complicated issue.And to propose a clustering algorithm with stronger adaptability has become an urgent problem to be solved.Density based clustering algorithms can be applied to cluster data in arbitrary shapes,and has strong anti-noise ability,which has become the research direction of clustering algorithms in recent years.Clustering by fast search and find of density peak(CFSFDP)is a clustering algorithm based on different density proposed by Alex Rodriguez.The algorithm is recognized by researchers as it is simple,and has good clustering effect and high time efficiency.The CFSFDP has achieved good clustering results on some data sets.But,this algorithm still has some shortcomings:(1)The CFSFDP is based on the condition that the central node is surrounded by nodes whose density are smaller than it;the central node is farther away from the node with a higher density.In this hypothesis,there is only one density peak in the same cluster.But,when there are multiple density peaks in the cluster,the algorithm can easily divide one cluster into multiple clusters.Thus,this algorithm cannot be applied to cluster the data sets with multiple density peaks because of this multicenter problem.(2)The truncation distance is the only parameter in CFSFDP.The length of the truncation distance determines the density of the nodes.The calculation method of the truncation distance is provided in the algorithm.However,experiments prove that this calculation method is not suitable for the data set with complex shapes.(3)In order to identify the noise node,the CFSFDP defines a boundary region,and other cluster nodes are also included in the boundary region.The algorithm sets the threshold for filtering noise nodes for each cluster to find the point where the density of its boundary region is the highest.The point in the cluster where the density is greater than the threshold is considered to be the cluster node,and the node whose density is less than the critical threshold is considered to be the noise node.This noise node identification accuracy is not high.Considering the shortcomings of CFSFDP,a path-based of Clustering by fast search and find of density peak(PB-CFSFDP)and a PB-CFSFDP algorithm for automatically identifying noise node are proposed.The PB-CFSFDP solves the shortcomings of the CFSFDP.The PB-CFSFDP algorithm for automatically identifying noise node realized the function of automatic noise recognition based on PB-CFSFDP.In order to better evaluate the clustering effect,this paper proposes a set of clustering algorithm evaluation indicators based on path-based distance,which can be well applied to different types of data sets.The research content of this paper is as follows:1.Proposing a new cluster evaluation indexThe distance between the nodes is a straight-line distance,making these three traditional indicators: Compactness,Separation and Davies-Bouldin Index,not applicable to non-convex data sets.In this paper,the author proposes three indicators: Path-based Compactness(PBCP),Path-based Separation(PBSP)and Path-based Davies-Bouldin Index(PBDBI),which can be well applied to convex and non-convex data sets.Based on PBDBI,a new clustering result evaluation index PBCI is proposed.This index comprehensively considers the Compactness,Separation and noise identification of clusters and it can be applied to evaluate the clustering results of different noiseidentifying clustering ways.2.Proposing the PB-CFSFDP algorithm(1)A similarity calculation method based on path distance is proposed,and the straight-line distance is replaced by the path-based distance between the nodes.The distance between the density peaks is shortened based on the path-based distance,so that there is at most one central node in the same cluster,which solves the problem that the CFSFDP algorithm mistakes a multi-center cluster into multiple clusters.(2)A new truncation distance calculation method is proposed.The truncated distance value is equal to the Lth long side value in the minimum spanning tree of the data set.Experiments explains that L shows a good clustering effect in most situations,and the truncation distance is highly robust.(3)A noise node recognition method based on decision diagram is proposed.The decision diagram has the node density as the abscissa and the neighboring k-distance close to the node as the ordinate.Then,we drew the distribution of all nodes in the decision diagram,and selected the nodes with small density and neighboring k-distance close to the node as noise nodes.Experiments show that this method can accurately identify noise nodes.3.Proposing the APB-CFSFDP algorithmThe PB-CFSFDP achieves good clustering effect for date set of different shapes.But,the noise nodes need to be manually selected in this algorithm,which reduces its calculating effectiveness.To solve this problem,The APB-CFSFDP is proposed based on PBCI and PB-CFSFDP.In this article,we compared PB-CFSFDP,APB-CFSFDP with K-means,DBSCAN,CFSFDP.Experiments show that these two algorithms have better data applicability,parameter robust,and accuracy of noise identification.Also,APB-CFSFDP has realized automatic noise identification based on PB-CFSFDP,which has improved the clustering effect and calculating efficiency.
Keywords/Search Tags:Data Mining, Cluster Analysis, Clustering by Fast Search and Find of Density Peak, Path-based Distance
PDF Full Text Request
Related items