Design And Implementation Of Parallel Sorting Algorithms Based On Spark Cluster

Posted on:2022-04-29

Degree:Master

Type:Thesis

Country:China

Candidate:B Wu

Full Text:PDF

GTID:2518306572959849

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The main challenge of parallel sorting algorithm in distributed cluster is how to make the workload of each node relatively balanced,because the load imbalance is easy to lead to the problem of data skewness,the skew data distribution of the original data set and a large number of replicate keys are the main reasons for this problem.For partition-based parallel sorting algorithms,the core of solving the data skewness problem is that the partition strategy of the splitter can divide the original data set as evenly as possible.In this paper,we summarize six criteria that an efficient splitter must meet when dividing raw data sets.We designed parallel sorting algorithms based on partitions for Bucket Sort,Counting Sort,Pigeonhole Sort,Radix Sort and Sample Sort on the Spark cluster,and analyzed how each algorithm satisfies the six criteria of the splitter partitioning strategy.For the sorting of Sample Sort,we designed two different strategies: percentile splitter strategy and suboptimal splitter strategy.The percentile splitter partitioning strategy focuses on equally dividing the data set,and does not consider the possible existence of non-hot keys in the partition of a hot key.The suboptimal splitter partitioning strategy ensures that hot keys and non-hot keys are not in the same partition through a partition mechanism of equal sets and unequal sets,and ensures that the data set can be evenly divided as far as possible by means of equal sets partitioning,small partition merging and bucket re-balance strategy.The bucket re-balance strategy works by evenly dividing the hot key data into multiple partitions in a round-robin fashion.We tested the algorithms on a 32-node Spark cluster with a total memory capacity of 768 GB.Experimental results show that,for data sets containing a large number of replicate keys,the Sample Sort has the smallest sorting time cost and the most balanced load distribution among the six parallel sorting algorithms implemented and the Sortby operator in Spark.In addition,the Sample Sort using the percentile splitter partitioning strategy has a relatively less sorting time than the Sample Sort using the suboptimal splitter partitioning strategy,but the latter has the characteristic that hot key data and non-hot key data are not in the same partition.

Keywords/Search Tags:

parallel sorting algorithms, replicate keys, splitter partitioning strategy, samplesort, bucket re-balance strategy

PDF Full Text Request

Related items

1	Parallel Algorithms And Parallel Implementation Of Meshless Numerical Simulation
2	Research Of Optimization Strategy For Parallel Graph Partitioning
3	Research On The Optimization Strategy Of Materials Sorting
4	Research On Graph Partitioning Algorithms And The Applications In The Large-scale Numerical Parallel Computation
5	Research On GPU Parallel Algorithm For Knapsack Problem Based On Partitioning Strategy
6	Research On Balance Convergence And Distribution Particle Swarm Optimization Algorithms For Dynamic Multi-objective Problems
7	DELTA Parallel Robot Trajectory Planning And Its Sorting Strategy Research On The Production Line
8	Research And Application Of MemSql Data Partitioning Strategy For Spark
9	Research On Circuit Partitioning Methodology Based On Graph Algorithms
10	Parallel Processing System Loop Partitioning And Data Distribution Technology Research