Font Size: a A A

The Research Of Optimized Density Peaks Clustering And Its Distributed Algorithms

Posted on:2016-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:B HanFull Text:PDF
GTID:2428330542957358Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering is a kind of common data analysis technology,which is used in many applications,such as astronomy,biology,social networks,pattern recognition,etc.The clustering algorithms can be classified into many catagories,including center-based clustering,density-based clustering,model-based,etc.But these clustering algorithms generally have some drawbacks,such as sensitivity to the selection of the initial input parameters,inefficient when dealing with large data,requirements on the shape of the cluster.Density peaks clustering algorithm(DP)is a recently published in the journal Science.It designates each object with two key features(rho and delta)for clustering.Compared with other traditional clustering algorithms,its superiority lies in interactive and iterative,independence of data distribution,etc.But the DP clustering algorithm is sensitive to the input parameter valuses dc,inappropriate values will impact the clustering quality.To solve this problem we propose the modified DP based on density of K neighbors-KNN-DP algorithm,which can effectively alleviate the sensitivity of dc value selection.In addition,in order to cluster high-dimensional large data sets,we design a simple KNN-DP distributed algorithm--Naive KNN-DP distributed algorithm using blocking strategy based MapReduce model.In order to further improve the efficiency of the algorithm,we propose a new KNN-DP algorithm based on LSH(Locality Sensitive Hashing).By using LSH,the close data points are hashed to the same buckets with a higher probability.The algorithm can avoid a large amount of invalid distance calculation and the intermediate results,greatly improving the efficiency of the algorithm of distributed computing,and as a result reducing the runtime.To evaluate KNN-DP algorithm,we perform a series of experiments.The results show that KNN-DP improves the cluster quality and at the same time reduces the runtime comparing with the original DP algorithm.In addition,we also evaluate the efficiency of the two proposed distributed algorithms,Naive KNN-DP distributed algorithms and LSH KNN-DP distributed algorithms.Our results show that LSH KNN-DP distributed algorithms requires much less runtie than Naive KNN-DP distributed algorithms,but at the same time with a little quality loss.
Keywords/Search Tags:density peaks clustering, k nearst neighbor average distance, MapReduce, LSH
PDF Full Text Request
Related items