The Research And Implementation Of Comprehensive Mapreduce

Posted on:2014-09-08

Degree:Master

Type:Thesis

Country:China

Candidate:S Jiang

Full Text:PDF

GTID:2428330491454027

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Along with the advance of the information technoledge,there has been an explosive growth of the data that can be acquired and store by human through out the last decade.Dealing with data of Peta-Bytes is becoming a daily requirement of many companies and institutions.In this paper,we illustrate the MapReduce,which is most popular massive data processing model at present,and analyze the shortages of the model in depth,especially in the dataflow and static partition mechanism of the intermediate data.The single fixed dataflow makes the model unsuitable for relational datasets,and will produce unnecessary Map phase and DFS IO.The static partition is easy to cause data skew and unsuitable Reduce instances.To deal with these problems,we proposed the Comprehensive MapReduce(CMR).CMR is a generalized model of MapReduce.All the work that MapReduce can do could be dealt with by CMR.However,CMR has significant improvement of MapReduce in the aspect of computational flow and intermediate data partition mechanism.In CMR,Map and Reduce are replaced by Function.Users can use dataflow to connect Functions.Function can deal with multi-inputs,and produce multi-outputs.A Directed Acyclic Graph(DAG)can be composed by Functions and dataflows for specifying the computation flow.The intermediate data are dynamically partitioned based on their distribution and size.So data skew can be restrained,and the population of the Function instances is also determined by the system automatically.There is also a mechanism called remote combiner.It is very like the combiner in MapReduce,but can process the intermediate data in a rack not only in the local machine.The CMR has significant advantages in efficiency and usability than Hadoop.In the paper,some experiments have also been introduced for comparing the efficiency between CMR and Hadoop.Results show that CMR has significant improvement of Hadoop dealing with the complicated situation.When the data skew is severe,CMR is also more efficient than Hadoop.However,the result of the Remote Combiner of CMR hasn't been apparently detected because of the limit test data we got.

Keywords/Search Tags:

massive data processing, Mapreduce, cloud computing, dynamic partition

PDF Full Text Request

Related items

1	Performance Optimization And Applications Of MapReduce In Cloud Computing
2	Research On Cloud Computing For Massive Data Process And Its Key Technologies
3	Design And Implementation Of Similarity Self - Connection Algorithm For Massive Data Sets Based On MapReduce
4	Research And Implementation Of Local Priority Scheduling Algorithm Based On Mapreduce For Massive Data
5	MapReduce-based Resource Scheduling Model And Algorithm Research In Cloud Environment
6	The Research And Implementation Of Diversity Demand Oriented Parallel Computing Model
7	Research On Pivotal Tenchnologies Of Massive Concurrent Data Processing Based On Cloud Computing
8	Research On Efficient Task Partition And Scheduling In MapReduce Data Processing System
9	The Research Of Parallel Clustering Algorithm Of Massive Data In Cloud Computing Environment
10	Join Method Research Based On MapReduce