Research And Implementation On Anti-skew Spark Intermediate Data Partition Mechanism

Posted on:2020-12-15

Degree:Master

Type:Thesis

Country:China

Candidate:W Lv

Full Text:PDF

GTID:2428330620951121

Subject:Computer Science and Technology

Abstract/Summary:

With the development of Internet technology,the type of Internet-related products has become increasingly diverse,the number of Internet users continues to rise,vast amounts of data generated by Internet users has brought great opportunities and challenges to the Internet industry.On the one hand,distributed computing technology and data mining algorithms can extract useful information from the data.On the other hand,due to the large scale and complex structure of data,there are some serious problems when dealing with data.Among them,partition skew is a common performance bottleneck in distributed big data computing.As a widely used distributed big data computing engine,Spark is also plagued by partition skew when running computing tasks.The partition skew problem is usually manifested by the fact that some tasks process much more data than other tasks.This not only wastes system resources,reduces computational efficiency,and may even cause task execution failure.Therefore,in order to ensure the efficient and smooth execution of Spark applications,the research of intermediate data partition algorithms is of great importance.However,the previous Spark partition method are not comprehensive.They do not consider the impact of map-side combination on data,nor do they focus on the fluctuation of amount of data after the shuffle operation is executed.In order to solve partition skew in the Spark computing framework,this thesis proposes an intermediate data partition method SKRSP(Spark-based key reassigning and splitting partition algorithm).It consists of two parts: intermediate data key distribution prediction,partition strategy generation and implement.Wherein,the first part,the intermediate data key distribution prediction,is used to estimate the frequency of the key.First,this method performs a step-based sampling algorithm on the selected partitions,and then estimate the key frequency of the intermediate data based on the impact of map side combination.The second part,partition strategy generation and implement is based on the results of the first part.Firstly,this thesis proposes a partition standard,which considers partition balance before and after shuffle operators,and sets the key weight to the sum of the intermediate data key frequency and the key frequency after shuffle operator,and then calculates the partition strategy based on these weights.There are two types of partition strategies: one is a Range-based key splitting method for sorting applications,which divides all keys into equal intervals in the order of keys,and splits the boundary keys to adjacent partitions.The other is a hash-based key allocation method for non-sorting applications,which first predicts which hash partitions may be skewed,and then allocates a part of the key of the skew hash partition other partitions,and the remaining part remains to the original hash partition.Finally,the partition strategy is applied to the specific partition process of the shuffle phase to achieve the purpose of partition load balance.In order to evaluate the SKRSP partition mechanism,we conduct experiments on the real Spark 2.2.0 cluster.Through a series of experiments,we verify the accuracy and validity of the intermediate data key distribution prediction algorithm,and the partition strategy can really reduce the skew of Spark partition and reduce the execution time of the Reduce phase task,thus reducing the execution time of the application.

Keywords/Search Tags:

Spark, data skew, data partition, intermediate data prediction

Related items

1	Research Of Data Skew On Spark Based On Imporved Partition Method
2	Research On Partition Loading Balance Based On Spark Data Skew
3	Research On Data Skew Optimization In Spark Computing Framework
4	Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle
5	An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment
6	Research And Optimization Of Data Placement Method In Spark Partitioner
7	Research On And Application Of The Solution For Spark Data Skew Scenarios
8	Research Of Task Partition And Resource Allocation Algorithms For Load Balance In Spark Computing Environment
9	Research Of Performance Optimization For Data Skew Based On High-speed Networks
10	Spark Task Scheduling With Data Skew And Deadline Constraints