Research On Parallelization Of Clustering Algorithm Based On Mapreduce

Posted on:2011-08-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y A Li

Full Text:PDF

GTID:2198330338483127

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Cluster analysis is an important part of data mining. It plays an increasingly important role in industry, business and scientific research. However, as the amount of data generated from these areas increase rapidly, traditional computer takes a long time to cluster the large-scale dataset. Using the parallel algorithm can effectively solve this problem.MapReduce, proposed by Google, is a model of parallel computing mainly for mass data processing. Compared with traditional model of parallel computing, MapReduce takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication.K-means is a basic division algorithm of clustering analysis, sum-of-squared-error criterion is often used as the clustering criterion function. k-means is relatively scalable and efficient. However, faced with massive dataset, k-means encountered the bottleneck of efficiency in calculating the distance between objects. The time of computing will increase with the size of dataset increasing. To break this bottleneck, this paper implements parallel k-means based on MapReduce in Hadoop platform . To further enhance the efficiency of k-means, we use canopy to optimize k-means, and implement parallel canopy-k-means based on MapReduce in Hadoop platform. At last, we compare the effectiveness, speedup and scalability of clustering results between parallel k-means and canopy-k-means based on the MapReduce.The result of experiment show that canopy-k-means based on MapReduce has higher accuracy, more convergence than k-means based on MapReduce. Both of them have good speedup and scalability.

Keywords/Search Tags:

Clustering, Canopy, K-means, Parallel Algorithm, MapReduce, Data Mining

PDF Full Text Request

Related items

1	Research On Parallelization Of Clustering Algorithm Based On Mapreduce
2	Research Of K-means Clustering Algorithm Based On MapReduce
3	Research On Distributed Clustering Algorithm Based On MapReduce
4	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform
5	Research And Application Of Clustering Mining Algorithm Oriented Big Data Based On MapReduce
6	Parallel Clustering Algorithm Based On MapReduce
7	Research On The Parallel Clustering Algorithm Based On MapReduce
8	Accelerating Clustering Algorithm On The Cuda Graphics Processor
9	Research On Clustering Collaborative Filtering Recommendation Algorithm Based On MapReduce
10	Research On Distributed Parallel Data Mining Algorithm Based On Weblog