Load Balancing Algorithm Based On Data Skew Of MapReduce

Posted on:2019-03-22

Degree:Master

Type:Thesis

Country:China

Candidate:G Z Liu

Full Text:PDF

GTID:2428330545491474

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

MapReduce has been widely used in large scale and high dimension data sets as a kind of distributed programming model,it shows good parallelism and extensibility in mass data processing.Original Hash function is versatile and simple when it is used to divide data in MapReduce,but it often results data skew when data distribution is not uniform.Most of the existing methods of solving data skew are to add a round of sampling operations to determine the frequency of key values and then repartition the data.For example,a parallel clustering algorithm based on MapReduce needs to perform multiple iterations,and the data's distribution of reduce is unclear at each stages of multiple iteration,so this kind of method will add multiple sampling jobs to original algorithm.To solve the imbalance of data partitioning,we proposes a kind of dynamic partition strategy to change the remaining partition index when data is tilted.First.by adding counters in the running process of Map,the data amount or record number allocated by Hash to each reducer is counted,and these messages are sent to Job Trackcr through heartbeat mechanism;Then,Job Trackcr establishes a data skew model based on the global partition information,and obtains load conditions of these reducers to find the reducer that exists data skew;Finally,Job Trackcr calculates the difference of skew reducer's hash value and light load reducer's,which named shift partition value.Then send it to Partitioner to dynamically modify the original partition function during the partition process.Remaining partitions' hash value will plus the shift value,so skew data will be processed by the reducer has lighter load,and each node could achieve load balancing.In addtion,considering heterogeneity of software and hardware in the cluster,we add performance parameters of reducer into the data skew model,so that our method can also achieve good load balancing in heterogeneous environment.We Compared the dynamic partitioning strategy's load balancing performance of this article with the Hash method and the existing dynamic sample method.The validity of this method is verified by running a Word Count program on a dataset that is consistent with the distribution of word frequency;By running the improved k-means++ algorithm on real dataset,we compared MapReduce jobs' execution efficiency after the data partition is balanced by these three methods.Experimental results show our strategy can solve the problem of data skew with better stability and efficiency than Hash method and Sampling method on MapReduce.

Keywords/Search Tags:

MapReduce, data skew, dynamic partition, load balance

PDF Full Text Request

Related items

1	Key Value Based Algorithm For Solving Reduce Load Imbalance In Mapreduce
2	Task Scheduling Optimization Based On Time And Load Balance Under The Hadoop Platform
3	Research On Partition Loading Balance Based On Spark Data Skew
4	Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle
5	Research On Load Balance Of Hadoop Cloud Computing Platform
6	Dynamic Data Partition In Distributed Information Networking Database Management System
7	Load Balancing In Map Reduce Based On Maxdiff Histogram
8	Research On Performance Optimization Of MapReduce Model
9	Research Of Task Partition And Resource Allocation Algorithms For Load Balance In Spark Computing Environment
10	The Research Of Skew With Sampling Technique In MapReduce