Font Size: a A A

Research On The Parallel Clustering Algorithm Based On MapReduce

Posted on:2017-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z M DingFull Text:PDF
GTID:2348330485452690Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As an unsupervised data processing method,clustering analysis is widely used in data mining,image processing,biology,astronomy and other fields.However,with the rapid development of Internet technology,a variety of data being a sharp increase.Faced up with such a huge data,it is unable to meet the requirements of the large-scale data's clustering analysis on the traditional computer.The MapReduce parallel programming model proposed by Google can deal with the traditional clustering algorithms on several computers,which can reduce the complexity of the algorithm and shorten the time of clustering.In this paper,we make further study on the fuzzy c-means algorithm and the main work is as follows:Firstly,aiming at the problem of high time complexity of the FCM algorithm,we proposed a parrallel FCM algorithm based on the MapReduce.Algorithm in the Map stage were parallel computing in the process of data set to the center of the degree of the membership,in the Reduce stage were parallel computing in the process of updating the cluster center.This paper also added a Combine process between the Map and Reduce process to merge the results of the Map process' s output and reduce the communication between the data nodes and the number of the Reduce process.Then,aiming at the problem that the FCM algorithm is sensitive to the initial clustering center,we use the characteristics that the Canopy algorithm can make a fast and coarse clustering to the data set,proposed the Fuzzy C-means clustering algorithm based on the Canopy algorithm(Canopy-FCM),designed and implemented the Canopy-FCM algorithm in MapReduce.Further,aiming at the problem that the Canopy algorithm is blind and imprecise,we use the characteristics that the minimum and maximum distance algorithm can get the better clustering quality,proposed the Fuzzy C-means clustering algorithm based on the minimum and maximum distance algorithm(MM-FCM).Then we designed and implemented the improved MM-FCM algorithm in MapReduce.Finally,we validated that the parallel FCM algorithm on the Hadoop platform has a good running efficiency on large scale data sets through the experiment,the improved Canopy-FCM algorithm and MM-FCM algorithm have better clustering quality and efficiency than the FCM algorithm.
Keywords/Search Tags:clustering analysis, parallel, Map Reduce, the FCM algorithm, the Canopy algorithm, the minimum and maximum distance algorithm
PDF Full Text Request
Related items