Font Size: a A A

Research On Parallelization Of Clustering Algorithm Based On Mapreduce

Posted on:2011-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y A LiFull Text:PDF
GTID:2198330338483127Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Cluster analysis is an important part of data mining. It plays an increasingly important role in industry, business and scientific research. However, as the amount of data generated from these areas increase rapidly, traditional computer takes a long time to cluster the large-scale dataset. Using the parallel algorithm can effectively solve this problem.MapReduce, proposed by Google, is a model of parallel computing mainly for mass data processing. Compared with traditional model of parallel computing, MapReduce takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication.K-means is a basic division algorithm of clustering analysis, sum-of-squared-error criterion is often used as the clustering criterion function. k-means is relatively scalable and efficient. However, faced with massive dataset, k-means encountered the bottleneck of efficiency in calculating the distance between objects. The time of computing will increase with the size of dataset increasing. To break this bottleneck, this paper implements parallel k-means based on MapReduce in Hadoop platform . To further enhance the efficiency of k-means, we use canopy to optimize k-means, and implement parallel canopy-k-means based on MapReduce in Hadoop platform. At last, we compare the effectiveness, speedup and scalability of clustering results between parallel k-means and canopy-k-means based on the MapReduce.The result of experiment show that canopy-k-means based on MapReduce has higher accuracy, more convergence than k-means based on MapReduce. Both of them have good speedup and scalability.
Keywords/Search Tags:Clustering, Canopy, K-means, Parallel Algorithm, MapReduce, Data Mining
PDF Full Text Request
Related items