Font Size: a A A

Research On Spark Shuffle Process Performance Optimization

Posted on:2020-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:T L ZhouFull Text:PDF
GTID:2428330590471732Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of big data,distributed computing has become more and more popular,resulting in many parallel computing frameworks,such as Spark,Storm,Dryad and so on.Unlike MapReduce,Spark stores data in memory to reduce I/O times,so it performs better when dealing with iterative jobs.As a MapReduce-like parallel computing framework,Spark also includes the Shuffle process,which is used to connect the Map phase to the Reduce phase.However,the Shuffle process triggers a lot of network and disk I/O,which directly affects the computing efficiency of Spark.This thesis conducts optimization research on the congestion problem of Shuffle process network existing on Spark and the "wooden barrel effect" under heterogeneous clusters.1.In order to maintain the order of stages,the Shuffle process requires task synchronization between nodes.The current synchronization mechanism will not only waste the computing resources of the cluster,but also cause serious network congestion.To solve this problem,a local task priority Shuffle strategy is proposed to generate ShuffleWrite task and perform Shuffle operation on some completed tasks first.This strategy increases the parallelism of data calculation and network transmission,reduces the peak value of network transmission in the Shuffle phase,and makes the network load of the cluster more balanced during the execution of jobs.Finally,experiments show that the partial task shuffle first strategy can improve the execution efficiency of Shuffle.2.Spark cannot sense the heterogeneity of clusters,and the Shuffle process cannot allocate data according to the computing power of nodes.Aiming at this problem,the evaluation model of node computing performance is designed by analyzing Spark's working mechanism,and the node computing performance is evaluated by monitoring module.On this basis,an adaptive data partitioning strategy is proposed.By increasing the number of buckets,reducing the granularity of buckets,and combining with partition allocation algorithm,this strategy makes the Bucket data obtained from the Reducer end after the Shuffle process adapt to its computing capacity.Experiments show that the proposed adaptive partitioning strategy can reduce the load of job data and reduce the execution time of jobs in heterogeneous clusters.In summary,optimization of Spark's Shuffle process can alleviate network congestion,adapt to heterogeneous clusters,improve Shuffle execution efficiency,reducejob execution time,and ultimately improve the performance of Spark computing framework.
Keywords/Search Tags:Spark shuffle, partial task first, network congestion, heterogeneous cluster, adaptive partition
PDF Full Text Request
Related items