The Research Of Parallel Clustering Algorithm Based On Hadoop Platform

Posted on:2018-04-05

Degree:Master

Type:Thesis

Country:China

Candidate:J H Liu

Full Text:PDF

GTID:2348330515476459

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the information technology developing,commercial database and the Internet have accumulated a large scale data.These data contain a variety of information content.People are eager to discover important message from a large amount of data.How to analyze the existing data quickly and access to its implied value accurately,has become a common problem faced by many companies and scholars.The clustering algorithm occupies a pivotal position in all data mining methods,and it is an effective measure to transform from unknown to known objects.Hadoop can run on a large number of nodes to compute parallel.Map Reduce is a parallel computing model in the Hadoop.It can greatly simplify the development of distributed parallelization process.The main work and innovation of this paper are as follows:(1)In order to solve the problem of low efficiency of Kmeans algorithm,we design a Kmeans parallelization algorithm based on Hadoop and optimize its implementation details to further improve the performance of the algorithm to deal with the massive data.The optimization strategy mainly includes: min-max normalization of the input data;adjusting the HDFS data block size;adding the Combine process in the middle of the Map phase and the Reduce phase,and combining the output of the Map phase to reduce the communication between the data nodes.(2)In order to solve the stochastic problem of initial clustering center for Kmeans parallelization algorithm,this paper uses Canopy algorithm to quickly cluster to obtain a set of initial clustering centers.In this paper,an improved Canopy-Kmeans parallelization algorithm is proposed to solve the problem that the center of the Canopy algorithm is not accurate.The main improvements are: the method of estimating the radius of the region to improve the selection of the center of the canopy,thus reducing the number of iterations of the algorithm;optimizing the Kmeans iterative process to reduce the overall computational complexity,further increasing the iteration speed;removing the isolated points in the dataset to get a more accurate initial clustering center.(3)This paper builds the Hadoop cloud computing platform in the laboratory environment.An improved Canopy-Kmeans parallelization algorithm based on Map Reduce was tested to verify its performance.The experimental results show that the improved Canopy-Kmeans parallelization algorithm is effective and convergent,and can further improve the clustering accuracy and reduce the number of iterations.The algorithm has good expansibility and acceleration ratio performance,which further proves that the parallel algorithm designed in this paper is suitable for dealing with massive data sets.

Keywords/Search Tags:

clustering, Kmeans, Canopy, Hadoop, MapReduce

PDF Full Text Request

Related items

1	The Research Of Clustering Mining Based On Logistics History Data On The Hadoop
2	Reach On Map-reduce Application Based On Hadoop
3	Reach On Map-Reduce Application Based On Hadoop
4	Research And Optimization On K-medoids Clustering Algorithm Based On Hadoop Platform
5	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform
6	Research And Implementation Of A Hybird Recommendation System Based On Auto Encoder And Canopy-Kmeans Algorithm
7	Research On The Application Of User Behavior Analysis Based On Hadoop
8	Research On MapReduce Model For Fusion Architecture And Accelerated Strategy For Hadoop
9	Research And Implementation Of Internet Public Opinion Analysis System Based On Hadoop
10	Application Of Improved Clustering Algorithm Based On Hadoop In Web Log Clustering