Font Size: a A A

Research Of Natural Neighbor Based Density Clustering Algorithm And Its Parallelization

Posted on:2019-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:J LiFull Text:PDF
GTID:2428330566977997Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering analysis is a kind of data processing technology which is based on the similarity between data objects.We can easily find that clustering analysis technique is widely applied to many academic fields,such as e-commerce,network security and so on.With the further study and exploration of this kind of technology,more and more algorithms have sprung up and cluster analysis technology has great development in recent decades.However,this technology still has great development space,such as,how to deal with the data sets with high dimensionality,and how to distinguish the clusters with various shapes.how to deal with noise point in data sets,how to deal with data sets which contains greater difference in density,how to obtain the number of categories of data sets effectively,and even how to evaluate the quality of a clustering result,and so on.There are many branches of clustering analysis technique.In particularly,clustering algorithm based on density can define the core point,boundary and density reachability to clusters the data set.This method can not only handle clusters with different shapes well,but also find out the noise points of data set accurately without predefining the number of clusters,and it has strongly interpretability.Because of these advantages,many scholars have devoted to the research of this kind of algorithm in recent years.However,with the in-depth research,we find that this kind of algorithm has many disadvantages.Take the one of the most classic algorithms-DBSCAN as example,first of all,this algorithm is high dependence on the input parameters,the selection of algorithm parameter has a significant effect on the clustering result.Secondly,by using the order of visiting core points to classify the boundary points is unexplainable.Finally,it can't deal with the data sets which contains greate difference in density.In this paper,a new density-based clustering algorithm(NN-DBSCAN)based on the natural neighbor algorithm is proposed.In this method,we process the dataset in advance by the natural neighbor algorithm,so that we can get the partial prior information to extract core points of data set and calculate the value of each data points' neighborhood radius.Obviously,there is not any input parameters in this method,and it also does a good job of dealing with the clusters with greate density differences.As the new algorithm modifies the definition of the direct density reachable in the DBSCAN algorithm to classify the boundary points more effectively.By analyzing the new algorithm's time complexity and the parallel framework which is widely used,we proposed a new parallel framework based on data and process to parallelize the natural neighbor algorithm.And the experimental results show that the NN-DBSCAN algorithm is better than the DBSCAN algorithm in many data sets and the new parallel framework which speed up natural neighbor algorithm works more efficiently than Spark.
Keywords/Search Tags:Clustering, natural neighbor, density, Core point, DBSCAN
PDF Full Text Request
Related items