Font Size: a A A

Research And Application Of Density Peak Clustering Algorithm Based On Spark Framework

Posted on:2021-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:F WuFull Text:PDF
GTID:2428330614470104Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the developing rapidly era of information,data has gradually become the focus of people's attention.The explosive growth trend of data has continuously promoted the development of data mining technology.However,traditional data mining techniques cannot quickly and efficiently process massive amounts of data.Therefore,parallel distributed data mining techniques provide new research directions for big data analysis.This article focuses on density-based clustering methods and improves the density peak clustering algorithm(CFSFDP)proposed by Alex Rodriguez in science in 2014.The automatic determination of clustering centers and the disadvantages of high time complexity make further improvements.The main work of this paper is as follows:(1)The CFSFDP algorithm needs to make artificial selection through a decision map when determining the clustering center point,and has a certain subjective consciousness.Therefore,the clustering results lack scientificity and accuracy.Aiming at this shortcoming,this paper proposes an AUTO-CFSFDP(Auto determine the cluster-Clustering by fast search and find of density peaks)that can automatically determine the cluster center.First of all,for the problem of uneven distribution of variables,the density and distance are normalized,and then the upper limit of the normalized density threshold is determined by the Chebyshev inequality.The standard deviation is used to determine the upper limit of the normalized distance threshold.The decision function determines the upper limit of the decision threshold,considers two kinds of determinants in a unified manner,avoids missing center points,and automatically determines the cluster center.Experimental results show that the algorithm can adaptively select clustering centers,and has good robustness and effectiveness.(2)AUTO-CFSFDP,like CFSFDP algorithm,needs to traverse the entire data set in the process of clustering,so it also has the disadvantage of high time complexity.In view of this disadvantage,this paper proposes a Parallel AUTO-CFSFDP algorithm——PAUTO-CFSFDP(Parallel Auto determine the cluster-Clustering by fast search and find of density peaks).This method first partitions the data,cuts the data into data spaces with substantially the same size,secondly performs local clustering in each data space,and finally summarizes the local clustering results for global clustering.On this basis,this paper has further improved the division of non-center points,using the principle of triangular inequality to simplify the division process.The experimental results show that the parallel density peak clustering algorithm under the Spark framework achieves better computational efficiency than the original algorithm.(3)this paper applies the parallel density peak clustering algorithm under the Spark framework to medical data of pregnant women,provides recommendations for pregnant women's pregnancy methods,and provides references for infant development.Experiments prove that the algorithm is practical value.
Keywords/Search Tags:clustering, peak density, chebyshev, distributed computing, data mining
PDF Full Text Request
Related items