Font Size: a A A

Design And Implementation Of Parallel Sorting Algorithms Based On Spark Cluster

Posted on:2022-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:B WuFull Text:PDF
GTID:2518306572959849Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The main challenge of parallel sorting algorithm in distributed cluster is how to make the workload of each node relatively balanced,because the load imbalance is easy to lead to the problem of data skewness,the skew data distribution of the original data set and a large number of replicate keys are the main reasons for this problem.For partition-based parallel sorting algorithms,the core of solving the data skewness problem is that the partition strategy of the splitter can divide the original data set as evenly as possible.In this paper,we summarize six criteria that an efficient splitter must meet when dividing raw data sets.We designed parallel sorting algorithms based on partitions for Bucket Sort,Counting Sort,Pigeonhole Sort,Radix Sort and Sample Sort on the Spark cluster,and analyzed how each algorithm satisfies the six criteria of the splitter partitioning strategy.For the sorting of Sample Sort,we designed two different strategies: percentile splitter strategy and suboptimal splitter strategy.The percentile splitter partitioning strategy focuses on equally dividing the data set,and does not consider the possible existence of non-hot keys in the partition of a hot key.The suboptimal splitter partitioning strategy ensures that hot keys and non-hot keys are not in the same partition through a partition mechanism of equal sets and unequal sets,and ensures that the data set can be evenly divided as far as possible by means of equal sets partitioning,small partition merging and bucket re-balance strategy.The bucket re-balance strategy works by evenly dividing the hot key data into multiple partitions in a round-robin fashion.We tested the algorithms on a 32-node Spark cluster with a total memory capacity of 768 GB.Experimental results show that,for data sets containing a large number of replicate keys,the Sample Sort has the smallest sorting time cost and the most balanced load distribution among the six parallel sorting algorithms implemented and the Sortby operator in Spark.In addition,the Sample Sort using the percentile splitter partitioning strategy has a relatively less sorting time than the Sample Sort using the suboptimal splitter partitioning strategy,but the latter has the characteristic that hot key data and non-hot key data are not in the same partition.
Keywords/Search Tags:parallel sorting algorithms, replicate keys, splitter partitioning strategy, samplesort, bucket re-balance strategy
PDF Full Text Request
Related items