Font Size: a A A

Research On Distributed Clustering Algorithm Based On Cloud Computing

Posted on:2019-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:J LuFull Text:PDF
GTID:2428330542995102Subject:Software engineering
Abstract/Summary:PDF Full Text Request
People produce various data in the process of production and living.These data often contain rich information.If this information can be mined,it may bring about tremendous improvements in human life and work.This demand has led to the birth of data mining.Cluster analysis is a very important data analysis method in the field of data mining.It divides the data in the database into different clusters(classes),and makes the similarities between the data in the clusters be similar to those between the clusters.At present,cluster analysis has been widely used in social network analysis,statistical data analysis,smart business and other fields.With the advancement of Internet and database technologies and hardware storage technologies,it has become possible for people to acquire and store large amounts of data.How to use data mining methods to rapidly analyze and extract high-dimensional large-scale data has become a hot topic today.Based on this,this paper uses the cloud computing framework MapReduce to solve the high complexity and high computational complexity of the current density peak clustering algorithm,and studies a distributed density peak clustering algorithm(DP-z)based on z-value.The algorithm uses spatial z-filling curves to map high-dimensional datasets onto one-dimensional space,and groups datasets according to z-value information of data points.In order to obtain the correct clustering result,the data between the groups is then interacted and then parallelized.The DP-z algorithm adopts a filtering strategy for data exchange between packets,which reduces a lot of invalid distance calculation and data transmission overhead,and effectively improves the execution efficiency of the algorithm.The theoretical analysis shows that the DP-z algorithm can effectively improve the efficiency of the algorithm compared with the original density peak clustering algorithm when the clustering results are the same.This paper designs and implements the DP-z algorithm on Hadoop open source cloud computing platform,and validates the effectiveness of the research method through comparative experiments.In addition,for the basic density clustering algorithm in the density calculation is not sensitive to the distance caused by the density calculation error may occur,this paper studies an improved density peak clustering algorithm,improved density peak The clusteringalgorithm improves the density measurement method based on the density measurement method of the original density clustering algorithm.In this paper,the number of data points in a certain range of data points is taken as the base,and the distribution of data points in this range is used as additional information,so that the density of each data point can be measured more accurately and the accuracy of the clustering algorithm can be improved.This paper experimentally verifies that the improved density peak clustering algorithm has a better clustering effect than the original clustering algorithm.
Keywords/Search Tags:Density peaks, Clustering analysis, Distributed computing, Z-order curve, Big data
PDF Full Text Request
Related items