Research On Optimal Reduce Placement Algorithm Based On Data Skew

Posted on:2017-06-15

Degree:Master

Type:Thesis

Country:China

Candidate:W Ma

Full Text:PDF

GTID:2428330488979840

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of network technology,Internet has become increasingly popular,businesses are affected by it and the number of Internet users continues to rise,which also makes the data be generated by the Internet toward huge quantity and miscellaneous direction.This has become a new direction of the Internet industry about the processing,search and mining of this type data.At the same time,these bring a great opportunity for the development of distributed computing and cloud computing.MapReduce is proposed for processing large amounts of data distri-buted parallel computing programming models by Google.Based on the framework applications can be easily writed.which can run on large clusters composed of thousands of machines.It has advantages of automatic parallel processing operation,high reliability,fault tolerance,easy to program and TB levels of massive data sets processed in parallel,etc.For frequent disk I/O and big data transmission among different racks and physical nodes,the intermediate data communication has become the biggest perform-ance bottle-neck in most running Hadoop systems.This paper proposes a reduce placement algorithm called CORP to schedule related map and reduce tasks on the near nodes or clusters or racks for the data locality.Since the number of keys cannot be counted until the input data are processed by map tasks,this paper firstly provides a sampling algorithm based on reservoir sampling to achieve the distribution of keys in intermediate data.With the random replacement policy in the randomly selected sample zone,the estimated results can be much better approximation to the real distribution.Based on the distribution matrix of intermediate results in each partition,through calculating the distance and cost matrices among the cross node communica-tion,the related map and reduce tasks can be scheduled to relatively near physical nodes for data locality.CORP is applicable to a wide range of applications and is transparent to the users.We implement CORP in Hadoop 2.4.0 and evaluate its performance using three widely used benchmarks:sort,grep,and join.Experimental results show that CORP can not only improve the balance of reduce tasks effectively,but also decrease the job execution time for the lower inner data communication.Comparing to some other reduce scheduling algorithms,the average data transmission of whole system on core switch have been decreased substantially.Not only do We finish this job in the cluster environment of Hadoop based on the data skewing to achieve the placement of tasks,but use the virtual machine to implement the place-ment of tasks as for the study of the cloud.The more closed to starting a task VM?the more localized data will be available.Based on existing models and work,we proposed the HM(Heuristic Migration)policy that can get more optimization of time.The experiment shows that we will get less executing time and the greater degree of localized data if there is a little the cross-rack traffic.

Keywords/Search Tags:

Hadoop, dataskew, data sampling, intermediate data, MapReduce, reduce placement, cloud computing, VM migation

PDF Full Text Request

Related items

1	Research On Optimization Of Map Reduce For Interactive Analysis On Big Data
2	Research And Implementation Of Correlation-Aware Data Placement On Hadoop
3	Research On Data Placement Technology In Mapreduce-styled Data Processing Platform
4	An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment
5	Research And Implementation Of Mapreduce Fault Tolerance Method Based On Intermediate Result Checkpoint
6	The Research Of MapReduce Job Scheduling Algorithm Based On The Hadoop Platform
7	Design And Implementation Of A Data Integration System Based On MapReduce
8	Research Of Replication And Placement Strategies For The Intermediate Data Of Scientific Workflow In Cloud
9	The Design Of The Cloud Computing System Based On Hadoop
10	Sampling-based Partitioning In Mapreduce For Skewed Data