Font Size: a A A

The Research And Implementation Of Comprehensive Mapreduce

Posted on:2014-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:S JiangFull Text:PDF
GTID:2428330491454027Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the advance of the information technoledge,there has been an explosive growth of the data that can be acquired and store by human through out the last decade.Dealing with data of Peta-Bytes is becoming a daily requirement of many companies and institutions.In this paper,we illustrate the MapReduce,which is most popular massive data processing model at present,and analyze the shortages of the model in depth,especially in the dataflow and static partition mechanism of the intermediate data.The single fixed dataflow makes the model unsuitable for relational datasets,and will produce unnecessary Map phase and DFS IO.The static partition is easy to cause data skew and unsuitable Reduce instances.To deal with these problems,we proposed the Comprehensive MapReduce(CMR).CMR is a generalized model of MapReduce.All the work that MapReduce can do could be dealt with by CMR.However,CMR has significant improvement of MapReduce in the aspect of computational flow and intermediate data partition mechanism.In CMR,Map and Reduce are replaced by Function.Users can use dataflow to connect Functions.Function can deal with multi-inputs,and produce multi-outputs.A Directed Acyclic Graph(DAG)can be composed by Functions and dataflows for specifying the computation flow.The intermediate data are dynamically partitioned based on their distribution and size.So data skew can be restrained,and the population of the Function instances is also determined by the system automatically.There is also a mechanism called remote combiner.It is very like the combiner in MapReduce,but can process the intermediate data in a rack not only in the local machine.The CMR has significant advantages in efficiency and usability than Hadoop.In the paper,some experiments have also been introduced for comparing the efficiency between CMR and Hadoop.Results show that CMR has significant improvement of Hadoop dealing with the complicated situation.When the data skew is severe,CMR is also more efficient than Hadoop.However,the result of the Remote Combiner of CMR hasn't been apparently detected because of the limit test data we got.
Keywords/Search Tags:massive data processing, Mapreduce, cloud computing, dynamic partition
PDF Full Text Request
Related items