Research On Partition Loading Balance Based On Spark Data Skew

Posted on:2021-04-02

Degree:Master

Type:Thesis

Country:China

Candidate:Z C Huang

Full Text:PDF

GTID:2428330602986115

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

Information data is rapidly expanding in this era of big data.Hadoop and Spark big data analysis platforms are usually used to process these large amounts of real-time data,and they provide a divide-and-conquer solution.However,the key issue of this solution for real-time data processing is data skew,which has a serious impact on the performance of Hadoop and Spark big data analysis platforms.At present,most of the data skew problems are solved based on the tradition data platform,and there is relatively little research on Spark data skew problems.In Spark,the default Spark partitioning algorithm will cause data skew after the Shuffle operation when the data is unevenly distributed.The existing method to solve the data skew problem is to distribute the overloaded tasks to additional split or merged partitions,but these additional operations in turn hinder the performance of the system.Thus,the problem of data skew in Spark is studied in this thesis,and focus on how to reduce the total completion time of the application through partition load balancing.Load balancing mechanism based on improved reservoir sampling algorithm and linear regression partition prediction are proposed.The main works of this paper is as follows.(1)In order to solve the problem of Reduce type data skew under the Spark computing framework,SP-IRS(Spark load balancing mechanism based on Improved Reservoir Sampling algorithm),a load balancing mechanism based on an improved reservoir sampling algorithm,is proposed.Compared with the existing mechanism,this algorithm adds a variable weight to the traditional reservoir sampling algorithm to predict the size of the partition.In order to make full use of cluster resources,the data skew detection model was used to classify the data into skew data and non-skew data,and the skew data was evenly distributed to each partition according to the matrix generated by the prediction.Thus,this mechanism can make the load more balanced.(2)In order to further optimize the total completion time of the application,SP-LRP(Spark load balancing mechanism based on Linear Regression Partition),a load balancing mechanism based on linear regression partition prediction,is proposed.This mechanism uses a linear regression prediction algorithm to create a Reducepartition prediction model.Compared with existing mechanisms,there is no additional sampling operation,so the entire job completion time was reduced.The specific framework as follows,firstly,the partition tracker used the heartbeat mechanism to analyze the analysis operation information.Secondly,the operation statistics were sent to the partition size predictor,which created a prediction model based on the linear regression algorithm.After predicting the partition size(each partition),used the data skew detection model to identify tilted partitions.Finally,the resource allocator created resource requirements based on the identified normal partition size.

Keywords/Search Tags:

Spark, MapReduce, Data skew, Resource scheduling, High performance computing

PDF Full Text Request

Related items

1	Research On Resource-aware Skew Mitigation For Mapreduce
2	Research And Optimization Of Adaptive Techniques For Mitigating Skew In Spark
3	Spark Task Scheduling With Data Skew And Deadline Constraints
4	The Research Of Scheduling Algorithms For Performance And Energy Consumption Under The Condition Of Data Skew
5	Research Of Performance Optimization For Data Skew Based On High-speed Networks
6	The Elastic Resource Allocation And Task Scheduling Of Spark
7	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
8	Research Of Task Partition And Resource Allocation Algorithms For Load Balance In Spark Computing Environment
9	Research And Improvement Of Job Scheduling Algorithm Based On Hadoop
10	The Research Of Handling Data Skew In MapReduce Computing Model