Research Of K-means Clustering Algorithm Based On MapReduce

Posted on:2017-08-05

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Li

Full Text:PDF

GTID:2348330536976758

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

Clustering algorithm has always been one of the most important branch in data mining algorithm.Without prior knowledge,clustering algorithm can help researchers get the regular pattern and specific organizational structure of the object form dataset.With the development of technology,the amount of data which is contained in the dataset grows exponentially.The traditional model of cluster analysis algorithm has been insufficient to deal with the current data size.Newly presented distributed platforms such as Hadoop,Spark,provides a new direction for the development and research of cluster analysis.Meanwhile,clustering algorithm has become a research emphasis.To deals with the problem that the traditional clustering algorithm can't handle big data clustering efficiently,this thesis researches and optimizes the clustering algorithm,then brings in cloud computing scheme.The main work of this thesis is asfollows:(1)Firstly,this thesis makes a deep analysis of k-means algorithm which is a classic partition-based algorithm.The features and implementation of k-means algorithm is introduced.Then this thesis elaborates several shortcomings of k-means algorithm.Based on these shortcomings,a solution of preprocessing the dataset to derive the initial k value and initial cluster centers of k-means algorithm is proposed.The algorithm is improved from the perspective of optimizing the initial value.Partition-based algorithm have a problem that is sensitive to the shape of the dataset,so this thesis analyzes and improves a density-based algorithm.(2)To solve the problem that the traditional model of clustering algorithm is difficult to handle large data sets,this thesis makes a study of the MapReduce programming model and makes a parallel design of the improved algorithm in MapReduce framework at Hadoop.(3)Through comparative experiments which compares the characteristics of the two algorithms in the processing of dealing with any shape data set,this thesis proves that the the k-means algorithm with optimizing the initial value is better than the original k-means algorithm on the aspects of clustering results.This thesis also proves that the two parallelized algorithm can fully reflect theadvantages of distributed computing,which greatly reduces the calculating time andmakes data processing efficiency greatly improved.

Keywords/Search Tags:

Clustering algorithm, k-means, canopy, parallel computing, MapReduce

PDF Full Text Request

Related items

1	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform
2	Research On Parallelization Of Clustering Algorithm Based On MapReduce
3	Research On Parallelization Of Clustering Algorithm Based On Mapreduce
4	Research On Distributed Clustering Algorithm Based On MapReduce
5	Parallel Clustering Algorithm Based On MapReduce
6	Research And Implementation Of Parallel Clustering Algorithm Based On Approximate Spectrum Hadoop MapReduce
7	Research On The Parallel Clustering Algorithm Based On MapReduce
8	Research On Clustering Collaborative Filtering Recommendation Algorithm Based On MapReduce
9	Research On Parallel Clustering Algorithm Based On Hadoop Cloud Computing Platform
10	The Research On Parallel Computing Technology In Precise Agricultural Climate Division