Font Size: a A A

Load Balancing Algorithm Based On Data Skew Of MapReduce

Posted on:2019-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:G Z LiuFull Text:PDF
GTID:2428330545491474Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
MapReduce has been widely used in large scale and high dimension data sets as a kind of distributed programming model,it shows good parallelism and extensibility in mass data processing.Original Hash function is versatile and simple when it is used to divide data in MapReduce,but it often results data skew when data distribution is not uniform.Most of the existing methods of solving data skew are to add a round of sampling operations to determine the frequency of key values and then repartition the data.For example,a parallel clustering algorithm based on MapReduce needs to perform multiple iterations,and the data's distribution of reduce is unclear at each stages of multiple iteration,so this kind of method will add multiple sampling jobs to original algorithm.To solve the imbalance of data partitioning,we proposes a kind of dynamic partition strategy to change the remaining partition index when data is tilted.First.by adding counters in the running process of Map,the data amount or record number allocated by Hash to each reducer is counted,and these messages are sent to Job Trackcr through heartbeat mechanism;Then,Job Trackcr establishes a data skew model based on the global partition information,and obtains load conditions of these reducers to find the reducer that exists data skew;Finally,Job Trackcr calculates the difference of skew reducer's hash value and light load reducer's,which named shift partition value.Then send it to Partitioner to dynamically modify the original partition function during the partition process.Remaining partitions' hash value will plus the shift value,so skew data will be processed by the reducer has lighter load,and each node could achieve load balancing.In addtion,considering heterogeneity of software and hardware in the cluster,we add performance parameters of reducer into the data skew model,so that our method can also achieve good load balancing in heterogeneous environment.We Compared the dynamic partitioning strategy's load balancing performance of this article with the Hash method and the existing dynamic sample method.The validity of this method is verified by running a Word Count program on a dataset that is consistent with the distribution of word frequency;By running the improved k-means++ algorithm on real dataset,we compared MapReduce jobs' execution efficiency after the data partition is balanced by these three methods.Experimental results show our strategy can solve the problem of data skew with better stability and efficiency than Hash method and Sampling method on MapReduce.
Keywords/Search Tags:MapReduce, data skew, dynamic partition, load balance
PDF Full Text Request
Related items