Font Size: a A A

Optimization And Research On Reduce Task Scheduling Strategy And Data Skew On Hadoop

Posted on:2018-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2428330596454785Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of big data era,the explosive growth data presents a challenge to the processing and computing power of existing IT architectures.MapReduce,as a new computational model,is generated naturally.Hadoop is a kind of open source implementation of MapReduce and it is widely used in the processing of big data by many companies.However,there are still some shortcomings.That the Reduce task scheduling strategy does not take data locality into account and it's unable to process data skew problem all hinder the further promotion of Hadoop platform.So the academia and the business community have carried out extensive research on Hadoop platform.To solve above two problems,this thesis carries out in-depth analysis and research,specific works are as follows:(1)In view of that the existing Hadoop platform resource management model can't manage node load and Reduce task scheduling strategy does not take data locality into account,this thesis puts forward a local model of the Reduce task based on the network topology of Hadoop,and uses the naive bayesian classification method for the node load classification.Finally this thesis puts forward a data locality awared multi-level balanced delay Reduce scheduling strategy MLBDS(Multi-Level Balanced Delay Scheduler),and embeds the implementation of the scheduling policy into the capacity scheduler.(2)This thesis analyzes the problem of data skew in Hadoop platform and two kinds of causes of data skew,then puts forward an incremental multi-queue partition strategy on the basis of sampling.Through the way of sampling,this thesis obtains the overall distribution of the key value,and divides the key value to multiple smaller partitions,finally uses multiple queue to evenly divide the smaller partitions into different partitions to achieve the purpose of solving data skew.(3)This thesis first builds a Hadoop cluster and achieves the proposed MLBDS scheduling strategy and incremental-based multi-queue partition strategy.Then this thesis compares the MLBDS strategy with the Capacity Scheduler and the Delay Scheduler,compares the incremental multi-queue partition strategy with the Hash partition strategy,the result verifies the correctness and validity of the proposed strategy.
Keywords/Search Tags:Locality, Load balancing, Delay, Data skew, Increment
PDF Full Text Request
Related items