Font Size: a A A

The Research Of Parallel Clustering Algorithm Based On Hadoop Platform

Posted on:2018-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:J H LiuFull Text:PDF
GTID:2348330515476459Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the information technology developing,commercial database and the Internet have accumulated a large scale data.These data contain a variety of information content.People are eager to discover important message from a large amount of data.How to analyze the existing data quickly and access to its implied value accurately,has become a common problem faced by many companies and scholars.The clustering algorithm occupies a pivotal position in all data mining methods,and it is an effective measure to transform from unknown to known objects.Hadoop can run on a large number of nodes to compute parallel.Map Reduce is a parallel computing model in the Hadoop.It can greatly simplify the development of distributed parallelization process.The main work and innovation of this paper are as follows:(1)In order to solve the problem of low efficiency of Kmeans algorithm,we design a Kmeans parallelization algorithm based on Hadoop and optimize its implementation details to further improve the performance of the algorithm to deal with the massive data.The optimization strategy mainly includes: min-max normalization of the input data;adjusting the HDFS data block size;adding the Combine process in the middle of the Map phase and the Reduce phase,and combining the output of the Map phase to reduce the communication between the data nodes.(2)In order to solve the stochastic problem of initial clustering center for Kmeans parallelization algorithm,this paper uses Canopy algorithm to quickly cluster to obtain a set of initial clustering centers.In this paper,an improved Canopy-Kmeans parallelization algorithm is proposed to solve the problem that the center of the Canopy algorithm is not accurate.The main improvements are: the method of estimating the radius of the region to improve the selection of the center of the canopy,thus reducing the number of iterations of the algorithm;optimizing the Kmeans iterative process to reduce the overall computational complexity,further increasing the iteration speed;removing the isolated points in the dataset to get a more accurate initial clustering center.(3)This paper builds the Hadoop cloud computing platform in the laboratory environment.An improved Canopy-Kmeans parallelization algorithm based on Map Reduce was tested to verify its performance.The experimental results show that the improved Canopy-Kmeans parallelization algorithm is effective and convergent,and can further improve the clustering accuracy and reduce the number of iterations.The algorithm has good expansibility and acceleration ratio performance,which further proves that the parallel algorithm designed in this paper is suitable for dealing with massive data sets.
Keywords/Search Tags:clustering, Kmeans, Canopy, Hadoop, MapReduce
PDF Full Text Request
Related items