Font Size: a A A

Research On Partition Loading Balance Based On Spark Data Skew

Posted on:2021-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z C HuangFull Text:PDF
GTID:2428330602986115Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Information data is rapidly expanding in this era of big data.Hadoop and Spark big data analysis platforms are usually used to process these large amounts of real-time data,and they provide a divide-and-conquer solution.However,the key issue of this solution for real-time data processing is data skew,which has a serious impact on the performance of Hadoop and Spark big data analysis platforms.At present,most of the data skew problems are solved based on the tradition data platform,and there is relatively little research on Spark data skew problems.In Spark,the default Spark partitioning algorithm will cause data skew after the Shuffle operation when the data is unevenly distributed.The existing method to solve the data skew problem is to distribute the overloaded tasks to additional split or merged partitions,but these additional operations in turn hinder the performance of the system.Thus,the problem of data skew in Spark is studied in this thesis,and focus on how to reduce the total completion time of the application through partition load balancing.Load balancing mechanism based on improved reservoir sampling algorithm and linear regression partition prediction are proposed.The main works of this paper is as follows.(1)In order to solve the problem of Reduce type data skew under the Spark computing framework,SP-IRS(Spark load balancing mechanism based on Improved Reservoir Sampling algorithm),a load balancing mechanism based on an improved reservoir sampling algorithm,is proposed.Compared with the existing mechanism,this algorithm adds a variable weight to the traditional reservoir sampling algorithm to predict the size of the partition.In order to make full use of cluster resources,the data skew detection model was used to classify the data into skew data and non-skew data,and the skew data was evenly distributed to each partition according to the matrix generated by the prediction.Thus,this mechanism can make the load more balanced.(2)In order to further optimize the total completion time of the application,SP-LRP(Spark load balancing mechanism based on Linear Regression Partition),a load balancing mechanism based on linear regression partition prediction,is proposed.This mechanism uses a linear regression prediction algorithm to create a Reducepartition prediction model.Compared with existing mechanisms,there is no additional sampling operation,so the entire job completion time was reduced.The specific framework as follows,firstly,the partition tracker used the heartbeat mechanism to analyze the analysis operation information.Secondly,the operation statistics were sent to the partition size predictor,which created a prediction model based on the linear regression algorithm.After predicting the partition size(each partition),used the data skew detection model to identify tilted partitions.Finally,the resource allocator created resource requirements based on the identified normal partition size.
Keywords/Search Tags:Spark, MapReduce, Data skew, Resource scheduling, High performance computing
PDF Full Text Request
Related items