Font Size: a A A

Research On Density Peak Clustering Algorithm By Automatically Determining Clusters

Posted on:2021-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z C WangFull Text:PDF
GTID:2518306050464554Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
The Density Peaks Clustering(DPC)published in“Science”in 2014 can automatically determine the number of clusters with an attempt of using density and peak,which can be used for dealing with data sets with various shapes.DPC has become a popular clustering algorithm and attracted a lot of interests of researchers due to its simple principle,remarkable efficiency,few regulation parameters.The main idea of this algorithm can be briefly described as follows.The cluster centers with big density and offset distance are determined by a so called decision graph first.And then clusters are formed through assigning data points to their nearest cluster center(with higher density).The key idea of the algorithm is based on two core intuitive observations.(1)Cluster center is always surrounded by intensive points and of larger density,and the densities of the points around it are less than that of the center.(2)The distance between two cluster centers is farther than that between any one of the two cluster centers and its neighbor points.However,there are some defects in the algorithm:(1)The cutoff distance d_c cannot be automatically selected and has to be selected according to the experience,i.e.,one has to select a cutoff distance by which the number of selected data points should be about 2%of the total number of data points on the data set.(2)The cluster center in DPC should be selected manually.This may cause a large error of selecting cluster centers and affects the final cluster results.(3)The performance of algorithm is affected a lot for large size and high dimension data set.In order to tackle these issues,some improvement schemes have been designed and two improved DPC algorithms were proposed as follows:1.A density peak clustering algorithm that can automatically determine the cutoff distance and cluster centers has been put forward to tackle defects(1)&(2)aforementioned.At first,a method to adaptively computing the cutoff distance d_c has been proposed based on Gini index.The smaller the Gini index,the greater the difference between the denseness and sparseness of the data set.That is to say,when the distribution of each type of data in the data set is increasingly uneven,it will be much more easily for clustering.By using this principle,the cutoff distance can be determined automatically.And then,score function values of local density and offset distance of each point are calculated with a map called the order map.Based on this,potential cluster centers could be determined automatically according to the score variation trend.At last,the final cluster centers can be determined using the method of the cluster center distance being greater than the threshold d_c.The experiments are conducted and the results indicate that the algorithm proposed in this section has a superior performance to the compared ones in both the accuracy of operation and the robustness of data processing.2.An improved grid-based density peak clustering algorithm has been proposed to overcome defect(3).Firstly,a grid partition method and definitions of the adjacent grid and the center point of the grid were given.And then,a more efficient and reasonable method of measuring local density and offset distance was utilized after enlarging the data point to the grid.Meanwhile,the improved algorithm 1 in this paper for automatically determining the cutoff distance and screening the cluster center is adopted to conduct grid-based clustering.By doing so,the space and time complexity of the proposed clustering algorithm have been decreased significantly compared to those of DPC.According to the experiment,the proposed algorithm is identified to have a good performance on both small and large data sets.What's more,the experiments also show a faster operation speed of the proposed algorithm than that of DPC.Besides,the robustness of the proposed algorithm is stronger,especially on large-scale data sets.
Keywords/Search Tags:Density peak, cluster center, variation trend, grid partition
PDF Full Text Request
Related items