Font Size: a A A

Research Of K-means Clustering Algorithm Based On MapReduce

Posted on:2017-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y H LiFull Text:PDF
GTID:2348330536976758Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Clustering algorithm has always been one of the most important branch in data mining algorithm.Without prior knowledge,clustering algorithm can help researchers get the regular pattern and specific organizational structure of the object form dataset.With the development of technology,the amount of data which is contained in the dataset grows exponentially.The traditional model of cluster analysis algorithm has been insufficient to deal with the current data size.Newly presented distributed platforms such as Hadoop,Spark,provides a new direction for the development and research of cluster analysis.Meanwhile,clustering algorithm has become a research emphasis.To deals with the problem that the traditional clustering algorithm can't handle big data clustering efficiently,this thesis researches and optimizes the clustering algorithm,then brings in cloud computing scheme.The main work of this thesis is asfollows:(1)Firstly,this thesis makes a deep analysis of k-means algorithm which is a classic partition-based algorithm.The features and implementation of k-means algorithm is introduced.Then this thesis elaborates several shortcomings of k-means algorithm.Based on these shortcomings,a solution of preprocessing the dataset to derive the initial k value and initial cluster centers of k-means algorithm is proposed.The algorithm is improved from the perspective of optimizing the initial value.Partition-based algorithm have a problem that is sensitive to the shape of the dataset,so this thesis analyzes and improves a density-based algorithm.(2)To solve the problem that the traditional model of clustering algorithm is difficult to handle large data sets,this thesis makes a study of the MapReduce programming model and makes a parallel design of the improved algorithm in MapReduce framework at Hadoop.(3)Through comparative experiments which compares the characteristics of the two algorithms in the processing of dealing with any shape data set,this thesis proves that the the k-means algorithm with optimizing the initial value is better than the original k-means algorithm on the aspects of clustering results.This thesis also proves that the two parallelized algorithm can fully reflect theadvantages of distributed computing,which greatly reduces the calculating time andmakes data processing efficiency greatly improved.
Keywords/Search Tags:Clustering algorithm, k-means, canopy, parallel computing, MapReduce
PDF Full Text Request
Related items