The Optimization Of Parallelized K-means Based On Mahout

Posted on:2017-01-10

Degree:Master

Type:Thesis

Country:China

Candidate:X Chen

Full Text:PDF

GTID:2348330503490038

Subject:Systems analysis and integration

Abstract/Summary:

PDF Full Text Request

Cluster analysis is an important means to extract useful information from large amounts of data, k-means algorithm is the most classic clustering algorithm,which is widely used beacause it's simple and effective. Nowadays, the rapid development of the Internet industry has led to a sharp increase in the amount of data, the traditional k-means clustering algorithm has been unable to meet the needs of massive data processing. Therefore, research for parallelization of k-means algorithm and optimization of parallel k-means algorithm is an urgent need. This article will explore the way of parallelism k-means algorithm implementation firstly, and then propose the optimization strategy which is suitable for mass data processing. The goal is reducing the time and space complexity of the algorithm, meanwhile obtain better clustering results.After in-depth study for the current research about the optimization and parallelization of the k-means algorithm, we learned that the current k-means algorithm optimization method is mainly designed for clustering of small amount of data on single-node server, at the same time the research about parallel k-means algorithm focus on algorithm designing, it can be seen that research on optimization of parallel k-means algorithm is still the weak link. Therefore, this paper formed the research ideas that optimizing parallel k-means algorithm by lower complexity algorithm. As a foreshadowing, this paper introduced the open-source distributed software framework Hadoop, MapReduce programming model and Mahout,which is a project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification; then it focused on the algorithm theory, algorithm defects and parallelized implementations in Mahout of K-means; finally, the optimization method for parallel k-means algorithm�� improving parallel k-means algorithm with Canopy is proposed.In the algorithm performance testing phase, we used the interfaces provided by Mahout such as k-means driver to code k-means and Canopy K-means algorithm, and clustered the data set which is in Gaussian distribution by k-meanss and Canopy k-means on Hadoop. Compared to k-means without optimization, the optimized k-means algorithm was better � it made the cluster tasks more stably converge to more accurate centroids with fewer interations, meanwhile avoided the execution time from significant increasing. All in all, the optimization effect of K-mean by Canopy was obvious.

Keywords/Search Tags:

clustering analysis, K-means algorithm, parallelization, Mahout Canopy

PDF Full Text Request

Related items

1	Research On Parallelization Of Clustering Algorithm Based On Mapreduce
2	Research On Parallelization Of Clustering Algorithm Based On MapReduce
3	The Design And Implementation Of Parallelization Of Canopy And FCM Clustering Algorithms On Spark Platform
4	Fuzzy C-means And K-means Clustering Algorithm And Its Parallel
5	Research And Application Of Parallelization Optimization Of Spatial Clustering Algorithm Based On Spark
6	Research On Hot Topics Discovery In Microblog Based On Distributed K-means Algorithms
7	The Parallelization And Optimization Of K-means Algorithm Based On Spark
8	K-Means Algorithm Design And Implementation Based On Hadoop And Mahout
9	Research Of K-means Clustering Algorithm Based On MapReduce
10	Application Research Of Improved K-means Algorithm In Big Data Clustering