Font Size: a A A

Distributed EM Clustering Algorithm Based On Hadoop Platform

Posted on:2015-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:J G SuFull Text:PDF
GTID:2268330428980091Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of Big Data, researchers both domestic and overseas increasingly putemphasis on how to obtain valuable knowledge from large data, how to discover meaningfulpatterns and rules from massive data by intelligent algorithms, and how to extract guidanceand leadership decision-making information from data ocean with effective tools. As animportant branch of data mining field, clustering analysis is a process which divide dataobjects into several clusters so that objects in the same class are as similar as possible, andobjects in different classes are as dissimilar as possible. However, with the rapid growth ofdata scale, stand-alone serial clustering algorithms encountered several bottlenecks, such asfailure to load the useful data into memory one-time, poor execution efficiency and failure toachieve parallel processing. The emergence and development of the Hadoop distributedcomputing technology provides an effective way to solve these problems.Hadoop distributed platform stores massive data with HDFS (Distributed File System)and manipulates the large-scale data sets in parallel model through MapReduce programmingframework. According to the characteristics of the serial clustering algorithms and combiningwith MR programming framework, researchers and users easily implement parallelalgorithms and improve the execution efficiency of algorithms without excessivelyunderstanding the underlying details of Hadoop platform. The most important significance isto help people to acquire worthy information and knowledge from large data.In clustering analysis, setting reasonable initialization parameters is a key indicator ofEM clustering algorithm based on Gaussian Mixture Model. Selection and settings ofinitialization parameters affect not only the complexity and the number of iterations ofalgorithm but also the final clustering results. Therefore, a well-defined initializationparameters selection mechanism can effectively reduce the number of iterations as well asimprove the accuracy of clustering results. By the analysis of traditional initialization methodsuch as random initialization, k-means initialization and hierarchical clustering initialization,this paper proposed a novel initialization method—MergeC which is based on density.According to the characteristics of every cluster of samples which are high center density and low edge density, the method extracts best candidate centers from every cluster to merge, andobtains the parameters for Gaussian Mixture Model. The experimental results showed that theproposed algorithm performed better.Considering these problems aroused by the traditional EM method which need to load theused data to the memory many times, are lack of effective parallelism and executionefficiency, this paper combined the serial EM algorithm with the MapReduce framework,presented a distributed EM clustering algorithm based on Hadoop platform and implementeddistributed parallel processing schemes of EM algorithm. The algorithm realized thedistributed processing of massive data with reasonable redundant operations which calculatedthe mean and covariance matrix with the help of two-phase—MeanMapReduce andVarMapReduce. In the end, we validated the performance of the proposed algorithm withdifferent size of data sets on the Hadoop platform. The results showed that the algorithm had abetter execution speed with the increase of data nodes. The methodology implemented theparallel clustering analysis and mining, and especially improved the efficiency of the EMalgorithm on processing massive data.
Keywords/Search Tags:EM Clustering Algorithm, Hadoop Platform, MapReduce Framework, Gaussian Mixture Model
PDF Full Text Request
Related items