Font Size: a A A

A Research Of Load Balancing Algorithms For Data Skew In Spark

Posted on:2019-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:C J HuangFull Text:PDF
GTID:2348330563454335Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of big data,academia and industry widely use large-scale big data platforms to process large amounts of data from different applications and data sources,and data skew is considered as one of the important factors that threaten the performance of big data platforms.The existing research on data skew is mostly based on Hadoop platform,and there is little research on memory computing platform such as Spark.In the process of Spark,there is also data skew problem due to the uneven distribution of input data and the unbalanced allocation of Spark's default partitioning algorithm.How to reduce the total makespan of big data applications on Spark platform is a major challenge.The small improvement of big data platforms may bring large efficiency improvement.So this thesis takes the data skew problem in Spark platform as the research object.First of all,the reason why the native Spark platform causes data skew problem has been analyzed.Then,a detailed study on how to reduce the total makespan of big data applications through data balanced allocation has been made.And two load balancing algorithms are proposed.The main research contents are as follows:Firstly,this thesis summarizes related research on data skew problem,divides multiple types of data skew problem according to the difference of processing stage and partition function,and summarizes the solutions for each type of data skew problem.Then,this thesis summarizes the design idea of the Spark platform,including the computing model and overall architecture,data storage system,shuffle read and write mechanism,and two native data partitioning algorithms,which lay the foundation for the implementation of load balancing algorithms;Secondly,a load balancing algorithm called ReducePartition is proposed to solve the Reduce-oriented data skew problem of Spark platform.The compute node samples the local intermediate data according to the sampling algorithm to predict the overall characteristics of data distribution.In order to take full advantage of cluster resources,ReducePartition divides data into multiple partitions evenly.Taking into account the differences in computational power among Executors,each task is assigned to Executor with the highest performance factor according to greedy strategy,so as to reduce the total makespan of big data applications.Next,a load balancing algorithm called MRFair is proposed to solve the Map & Reduce oriented data skew problem of Spark platform.MRFair reassigns the remaining unprocessed data of the longest running task to other idle tasks by estimating the remaining computation time of a task,as far as possible to minimize the impact of data skew and reduce the total makespan of big data applications.Finally,the results of the related algorithms and the above algorithms are compared between WordCount benchmark and Sort benchmark on heterogeneous Spark Standalone cluster.The performance of the above algorithms under different degree of data skew and different data size is analyzed.Multiple sets of tests show that the proposed algorithms can effectively reduce the impact of data skew on the total makespan of Spark big data applications.
Keywords/Search Tags:Spark platform, data skew, load balancing, total makespan
PDF Full Text Request
Related items