Optimization And Research On Reduce Task Scheduling Strategy And Data Skew On Hadoop

Posted on:2018-05-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2428330596454785

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the advent of big data era,the explosive growth data presents a challenge to the processing and computing power of existing IT architectures.MapReduce,as a new computational model,is generated naturally.Hadoop is a kind of open source implementation of MapReduce and it is widely used in the processing of big data by many companies.However,there are still some shortcomings.That the Reduce task scheduling strategy does not take data locality into account and it's unable to process data skew problem all hinder the further promotion of Hadoop platform.So the academia and the business community have carried out extensive research on Hadoop platform.To solve above two problems,this thesis carries out in-depth analysis and research,specific works are as follows:(1)In view of that the existing Hadoop platform resource management model can't manage node load and Reduce task scheduling strategy does not take data locality into account,this thesis puts forward a local model of the Reduce task based on the network topology of Hadoop,and uses the naive bayesian classification method for the node load classification.Finally this thesis puts forward a data locality awared multi-level balanced delay Reduce scheduling strategy MLBDS(Multi-Level Balanced Delay Scheduler),and embeds the implementation of the scheduling policy into the capacity scheduler.(2)This thesis analyzes the problem of data skew in Hadoop platform and two kinds of causes of data skew,then puts forward an incremental multi-queue partition strategy on the basis of sampling.Through the way of sampling,this thesis obtains the overall distribution of the key value,and divides the key value to multiple smaller partitions,finally uses multiple queue to evenly divide the smaller partitions into different partitions to achieve the purpose of solving data skew.(3)This thesis first builds a Hadoop cluster and achieves the proposed MLBDS scheduling strategy and incremental-based multi-queue partition strategy.Then this thesis compares the MLBDS strategy with the Capacity Scheduler and the Delay Scheduler,compares the incremental multi-queue partition strategy with the Hash partition strategy,the result verifies the correctness and validity of the proposed strategy.

Keywords/Search Tags:

Locality, Load balancing, Delay, Data skew, Increment

PDF Full Text Request

Related items

1	A Research Of Load Balancing Algorithms For Data Skew In Spark
2	The Research Of Load Balancing In Mapreduce Based On Data Locality
3	Load Balancing Algorithm Based On Data Skew Of MapReduce
4	An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment
5	Research Of Data Skew On Spark Based On Imporved Partition Method
6	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
7	The Research Of Load Balancing In Mapreduce Based On Sampling Estimation
8	Research On Lightweight Load Balancing Under Mapreduce
9	The Research Of Skew With Sampling Technique In MapReduce
10	Research And Implementation Of Skew Join Optimization Technology On MyCat