Font Size: a A A

Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle

Posted on:2020-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:M L HuangFull Text:PDF
GTID:2428330599958994Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Distributed computing platforms facilitate the efficient processing of massive amounts of data,and Spark is widely used in big data research with its advantages based on memory computing.Data shuffling is an indispensable process in Spark.If data skew occurs in Shuffle,it will seriously affect the operating efficiency of the entire distributed cluster.For the data skew problem of Shuffle,the current dynamic partitioning solutions have the disadvantages of weak dynamic adaptability and Coarse-grained.By analyzing the data partitioning principle of Shuffle,the dynamic equalization partitioning method SPDB(Spark Partition Dynamic Balanced)for Spark Shuffle operator is implemented.The method is based on the Shuffle operator of Resilient Distributed Datasets(RDD)to estimate the overall tilt of the intermediate data.The overall tilt of the intermediate data is estimated by one sampling,and the overall partition strategy of the application execution is realized.The strategy adjusts the number of partitions and performs balanced partitioning operations for different execution phases.The method mitigates the impact of data skew on performance.In the SPDB method,firstly,in order to predict the distribution variation of the intermediate data,the intermediate data of each Shuffle operation is sampled and preprocessed by the reservoir sampling,and the inclination of the overall data is estimated.Then,using the evaluation result,the partition decision is made for each Shuffle operator running in the application,and the partition dynamic adjustment for the Shuffle operator is realized by updating the partition execution plan and the key partition expansion coefficient of each Shuffle.Furthermore,for the case that the default number of partitions of the data skew scene is unreasonable,combined with the consideration of the default partition number and operating parameters,the partition number adjustment algorithm based on the key expansion coefficient is implemented.Finally,a data equalization partition algorithm based on expansion coefficient is designed.The keys of different tilt levels are partitioned according to the expansion coefficient,which ensures the balanced distribution of data and improves the parallel computing performance of Spark.The experimental results of SPDB balanced partition optimization are verified.The results show that in the data skew scenario,the performance of the SPDB method can generally be increased by 10% to 40% compared to the default Spark.
Keywords/Search Tags:Distributed Cluster, Data Skew, Dynamic Balanced Partition Method
PDF Full Text Request
Related items