Research On The Parallel Clustering Algorithm Based On MapReduce

Posted on:2017-01-03

Degree:Master

Type:Thesis

Country:China

Candidate:Z M Ding

Full Text:PDF

GTID:2348330485452690

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

As an unsupervised data processing method,clustering analysis is widely used in data mining,image processing,biology,astronomy and other fields.However,with the rapid development of Internet technology,a variety of data being a sharp increase.Faced up with such a huge data,it is unable to meet the requirements of the large-scale data's clustering analysis on the traditional computer.The MapReduce parallel programming model proposed by Google can deal with the traditional clustering algorithms on several computers,which can reduce the complexity of the algorithm and shorten the time of clustering.In this paper,we make further study on the fuzzy c-means algorithm and the main work is as follows:Firstly,aiming at the problem of high time complexity of the FCM algorithm,we proposed a parrallel FCM algorithm based on the MapReduce.Algorithm in the Map stage were parallel computing in the process of data set to the center of the degree of the membership,in the Reduce stage were parallel computing in the process of updating the cluster center.This paper also added a Combine process between the Map and Reduce process to merge the results of the Map process' s output and reduce the communication between the data nodes and the number of the Reduce process.Then,aiming at the problem that the FCM algorithm is sensitive to the initial clustering center,we use the characteristics that the Canopy algorithm can make a fast and coarse clustering to the data set,proposed the Fuzzy C-means clustering algorithm based on the Canopy algorithm(Canopy-FCM),designed and implemented the Canopy-FCM algorithm in MapReduce.Further,aiming at the problem that the Canopy algorithm is blind and imprecise,we use the characteristics that the minimum and maximum distance algorithm can get the better clustering quality,proposed the Fuzzy C-means clustering algorithm based on the minimum and maximum distance algorithm(MM-FCM).Then we designed and implemented the improved MM-FCM algorithm in MapReduce.Finally,we validated that the parallel FCM algorithm on the Hadoop platform has a good running efficiency on large scale data sets through the experiment,the improved Canopy-FCM algorithm and MM-FCM algorithm have better clustering quality and efficiency than the FCM algorithm.

Keywords/Search Tags:

clustering analysis, parallel, Map Reduce, the FCM algorithm, the Canopy algorithm, the minimum and maximum distance algorithm

PDF Full Text Request

Related items

1	Research And Application Of K-means Clustering Algorithm Based On Distributed Computing Platform
2	Research On Text Clustering And Its Application In Topic Detection Analysis
3	Chase, Based Decoding Algorithm
4	Research On Distributed Clustering Algorithm Based On MapReduce
5	Research On Parallel Clustering Algorithm Based On Map-Reduce
6	Research On Collaborative Filtering Recommendation Algorithm Based On Clustering
7	The Research Of The Maximum Flow And The Minimum Cost Algorithm
8	Research On Parallelization Of Clustering Algorithm Based On MapReduce
9	Research On Parallelization Of Clustering Algorithm Based On Mapreduce
10	The Limitations Of Collaborative Filtering Algorithm And Its Improvement