Research And Application Of Density Peak Clustering Algorithm Based On Spark Framework

Posted on:2021-01-09

Degree:Master

Type:Thesis

Country:China

Candidate:F Wu

Full Text:PDF

GTID:2428330614470104

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In the developing rapidly era of information,data has gradually become the focus of people's attention.The explosive growth trend of data has continuously promoted the development of data mining technology.However,traditional data mining techniques cannot quickly and efficiently process massive amounts of data.Therefore,parallel distributed data mining techniques provide new research directions for big data analysis.This article focuses on density-based clustering methods and improves the density peak clustering algorithm(CFSFDP)proposed by Alex Rodriguez in science in 2014.The automatic determination of clustering centers and the disadvantages of high time complexity make further improvements.The main work of this paper is as follows:(1)The CFSFDP algorithm needs to make artificial selection through a decision map when determining the clustering center point,and has a certain subjective consciousness.Therefore,the clustering results lack scientificity and accuracy.Aiming at this shortcoming,this paper proposes an AUTO-CFSFDP(Auto determine the cluster-Clustering by fast search and find of density peaks)that can automatically determine the cluster center.First of all,for the problem of uneven distribution of variables,the density and distance are normalized,and then the upper limit of the normalized density threshold is determined by the Chebyshev inequality.The standard deviation is used to determine the upper limit of the normalized distance threshold.The decision function determines the upper limit of the decision threshold,considers two kinds of determinants in a unified manner,avoids missing center points,and automatically determines the cluster center.Experimental results show that the algorithm can adaptively select clustering centers,and has good robustness and effectiveness.(2)AUTO-CFSFDP,like CFSFDP algorithm,needs to traverse the entire data set in the process of clustering,so it also has the disadvantage of high time complexity.In view of this disadvantage,this paper proposes a Parallel AUTO-CFSFDP algorithm��PAUTO-CFSFDP(Parallel Auto determine the cluster-Clustering by fast search and find of density peaks).This method first partitions the data,cuts the data into data spaces with substantially the same size,secondly performs local clustering in each data space,and finally summarizes the local clustering results for global clustering.On this basis,this paper has further improved the division of non-center points,using the principle of triangular inequality to simplify the division process.The experimental results show that the parallel density peak clustering algorithm under the Spark framework achieves better computational efficiency than the original algorithm.(3)this paper applies the parallel density peak clustering algorithm under the Spark framework to medical data of pregnant women,provides recommendations for pregnant women's pregnancy methods,and provides references for infant development.Experiments prove that the algorithm is practical value.

Keywords/Search Tags:

clustering, peak density, chebyshev, distributed computing, data mining

PDF Full Text Request

Related items

1	Research And Application Of Clustering By Fast Search And Find Of Density Peaks
2	Density-based Clustering Algorithm On Streaming Data
3	Research On The Grid Density Peak Clustering Algorithm
4	Multi-Granular Big Data Analytics Based On Density Peak
5	Research Of Clustering Algorithm Based On Density Peak
6	Research On Hierarchical Clustering Algorithm Based On Density Peaks
7	Research And Application Of Fast Density Peak Clustering Algorithm
8	Research On Density Peak Clustering And Its Application In Community Detection
9	Research And Implementation Of Density Peaks Clustering Algorithm
10	Research And Application Of Financial Big Data Based On Density Peak Clustering Of K Near Neighbors