Font Size: a A A

The Optimization Of Parallelized K-means Based On Mahout

Posted on:2017-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:X ChenFull Text:PDF
GTID:2348330503490038Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
Cluster analysis is an important means to extract useful information from large amounts of data, k-means algorithm is the most classic clustering algorithm,which is widely used beacause it's simple and effective. Nowadays, the rapid development of the Internet industry has led to a sharp increase in the amount of data, the traditional k-means clustering algorithm has been unable to meet the needs of massive data processing. Therefore, research for parallelization of k-means algorithm and optimization of parallel k-means algorithm is an urgent need. This article will explore the way of parallelism k-means algorithm implementation firstly, and then propose the optimization strategy which is suitable for mass data processing. The goal is reducing the time and space complexity of the algorithm, meanwhile obtain better clustering results.After in-depth study for the current research about the optimization and parallelization of the k-means algorithm, we learned that the current k-means algorithm optimization method is mainly designed for clustering of small amount of data on single-node server, at the same time the research about parallel k-means algorithm focus on algorithm designing, it can be seen that research on optimization of parallel k-means algorithm is still the weak link. Therefore, this paper formed the research ideas that optimizing parallel k-means algorithm by lower complexity algorithm. As a foreshadowing, this paper introduced the open-source distributed software framework Hadoop, MapReduce programming model and Mahout,which is a project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification; then it focused on the algorithm theory, algorithm defects and parallelized implementations in Mahout of K-means; finally, the optimization method for parallel k-means algorithm—— improving parallel k-means algorithm with Canopy is proposed.In the algorithm performance testing phase, we used the interfaces provided by Mahout such as k-means driver to code k-means and Canopy K-means algorithm, and clustered the data set which is in Gaussian distribution by k-meanss and Canopy k-means on Hadoop. Compared to k-means without optimization, the optimized k-means algorithm was better – it made the cluster tasks more stably converge to more accurate centroids with fewer interations, meanwhile avoided the execution time from significant increasing. All in all, the optimization effect of K-mean by Canopy was obvious.
Keywords/Search Tags:clustering analysis, K-means algorithm, parallelization, Mahout Canopy
PDF Full Text Request
Related items