Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle

Posted on:2020-05-13

Degree:Master

Type:Thesis

Country:China

Candidate:M L Huang

Full Text:PDF

GTID:2428330599958994

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Distributed computing platforms facilitate the efficient processing of massive amounts of data,and Spark is widely used in big data research with its advantages based on memory computing.Data shuffling is an indispensable process in Spark.If data skew occurs in Shuffle,it will seriously affect the operating efficiency of the entire distributed cluster.For the data skew problem of Shuffle,the current dynamic partitioning solutions have the disadvantages of weak dynamic adaptability and Coarse-grained.By analyzing the data partitioning principle of Shuffle,the dynamic equalization partitioning method SPDB(Spark Partition Dynamic Balanced)for Spark Shuffle operator is implemented.The method is based on the Shuffle operator of Resilient Distributed Datasets(RDD)to estimate the overall tilt of the intermediate data.The overall tilt of the intermediate data is estimated by one sampling,and the overall partition strategy of the application execution is realized.The strategy adjusts the number of partitions and performs balanced partitioning operations for different execution phases.The method mitigates the impact of data skew on performance.In the SPDB method,firstly,in order to predict the distribution variation of the intermediate data,the intermediate data of each Shuffle operation is sampled and preprocessed by the reservoir sampling,and the inclination of the overall data is estimated.Then,using the evaluation result,the partition decision is made for each Shuffle operator running in the application,and the partition dynamic adjustment for the Shuffle operator is realized by updating the partition execution plan and the key partition expansion coefficient of each Shuffle.Furthermore,for the case that the default number of partitions of the data skew scene is unreasonable,combined with the consideration of the default partition number and operating parameters,the partition number adjustment algorithm based on the key expansion coefficient is implemented.Finally,a data equalization partition algorithm based on expansion coefficient is designed.The keys of different tilt levels are partitioned according to the expansion coefficient,which ensures the balanced distribution of data and improves the parallel computing performance of Spark.The experimental results of SPDB balanced partition optimization are verified.The results show that in the data skew scenario,the performance of the SPDB method can generally be increased by 10% to 40% compared to the default Spark.

Keywords/Search Tags:

Distributed Cluster, Data Skew, Dynamic Balanced Partition Method

PDF Full Text Request

Related items

1	Research And Implementation Of Balanced Partition Method Based On Spark Computing Granularity Adjustment
2	Research Of Data Skew On Spark Based On Imporved Partition Method
3	Load Balancing Algorithm Based On Data Skew Of MapReduce
4	Research And Implementation On Anti-skew Spark Intermediate Data Partition Mechanism
5	Research On Data Skew Optimization In Spark Computing Framework
6	Dynamic Data Partition In Distributed Information Networking Database Management System
7	Research On Partition Loading Balance Based On Spark Data Skew
8	Research And Optimization Of Data Placement Method In Spark Partitioner
9	Study On Partition-based Clustering Algorithm
10	Research Of Performance Optimization For Data Skew Based On High-speed Networks