Study On Iterative Mapreduce Computation Model For Clustering Analysis

Posted on:2013-12-07

Degree:Master

Type:Thesis

Country:China

Candidate:W Xu

Full Text:PDF

GTID:2268330392970589

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

MapReduce computation model is an efficient way to process large-scale datasets.It is widely used in search engines, e-commerce and social networking. However, theefficiency of using MapReduce computation model to solve iterative problem isdragged by runtime environment re-initialization, static data re-loading andintermediate network load. In this paper, we divide data as medium-scale data andlarge-scale data by whether it can be dispersed cached in distributed environmentnodes’ memory, and design two optimizing strategies aimed at different scales data toimprove the efficiency of iterative MapReduce.MapCombine solution is design to process medium-scale data. It adds cachefunction to combine tasks to avoid re-loading static data; adds a new componentController to schedule the iterations to avoid runtime environment re-initialization;adds interaction layer based on HBase for persistent intermediate data to ensure therobustness.CycleMap solution is design to process large-scale data. It adds a Collectorcomponent to avoid the performance degradation which is caused by the sort processand the shuffle process. Pipeline concept is introduced in this solution, indirectlyrealized the goal that the whole iterative task only need to initialize the runtimeenvironment once.Finally, we show the performance comparisons between MapCombine/CycleMapand Mahout for three clustering algorithms, which are K-Means, Fuzzy K-Means andDirichlet Process. The average speedup ratio provided by MapCombine andCycleMap are1.10and1.05.

Keywords/Search Tags:

Clustering Algorithms, MapReduce, Iteration, Hadoop

PDF Full Text Request

Related items

1	Research, Design And Application Of Clustering Algorithm Using Mapreduce
2	Parallel Clustering Algorithm Based On MapReduce
3	Design Of Mapreduce Task Scheduling Algorithms In Heterogeneous Hadoop Cluster
4	The Research And Implementation Of Clustering And Convex Hull Algorithms Based On MapReduce Framework
5	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform
6	Design And Implementation Of Community Detection Algorithms Baesd On Mapreduce
7	The Research Of Parallel Clustering Algorithm Based On Hadoop Platform
8	The Research And Application Of Security Log Clustering Mining Algorithm Based On Hadoop Platform
9	Research And Implementation Of Parallel Clustering Algorithm Based On Approximate Spectrum Hadoop MapReduce
10	Research Of Clustering Algorithm Based On Mahout