Font Size: a A A

Study On Iterative Mapreduce Computation Model For Clustering Analysis

Posted on:2013-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:W XuFull Text:PDF
GTID:2268330392970589Subject:Computer technology
Abstract/Summary:PDF Full Text Request
MapReduce computation model is an efficient way to process large-scale datasets.It is widely used in search engines, e-commerce and social networking. However, theefficiency of using MapReduce computation model to solve iterative problem isdragged by runtime environment re-initialization, static data re-loading andintermediate network load. In this paper, we divide data as medium-scale data andlarge-scale data by whether it can be dispersed cached in distributed environmentnodes’ memory, and design two optimizing strategies aimed at different scales data toimprove the efficiency of iterative MapReduce.MapCombine solution is design to process medium-scale data. It adds cachefunction to combine tasks to avoid re-loading static data; adds a new componentController to schedule the iterations to avoid runtime environment re-initialization;adds interaction layer based on HBase for persistent intermediate data to ensure therobustness.CycleMap solution is design to process large-scale data. It adds a Collectorcomponent to avoid the performance degradation which is caused by the sort processand the shuffle process. Pipeline concept is introduced in this solution, indirectlyrealized the goal that the whole iterative task only need to initialize the runtimeenvironment once.Finally, we show the performance comparisons between MapCombine/CycleMapand Mahout for three clustering algorithms, which are K-Means, Fuzzy K-Means andDirichlet Process. The average speedup ratio provided by MapCombine andCycleMap are1.10and1.05.
Keywords/Search Tags:Clustering Algorithms, MapReduce, Iteration, Hadoop
PDF Full Text Request
Related items