Research On Spark Shuffle Process Performance Optimization

Posted on:2020-03-04

Degree:Master

Type:Thesis

Country:China

Candidate:T L Zhou

Full Text:PDF

GTID:2428330590471732

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of big data,distributed computing has become more and more popular,resulting in many parallel computing frameworks,such as Spark,Storm,Dryad and so on.Unlike MapReduce,Spark stores data in memory to reduce I/O times,so it performs better when dealing with iterative jobs.As a MapReduce-like parallel computing framework,Spark also includes the Shuffle process,which is used to connect the Map phase to the Reduce phase.However,the Shuffle process triggers a lot of network and disk I/O,which directly affects the computing efficiency of Spark.This thesis conducts optimization research on the congestion problem of Shuffle process network existing on Spark and the "wooden barrel effect" under heterogeneous clusters.1.In order to maintain the order of stages,the Shuffle process requires task synchronization between nodes.The current synchronization mechanism will not only waste the computing resources of the cluster,but also cause serious network congestion.To solve this problem,a local task priority Shuffle strategy is proposed to generate ShuffleWrite task and perform Shuffle operation on some completed tasks first.This strategy increases the parallelism of data calculation and network transmission,reduces the peak value of network transmission in the Shuffle phase,and makes the network load of the cluster more balanced during the execution of jobs.Finally,experiments show that the partial task shuffle first strategy can improve the execution efficiency of Shuffle.2.Spark cannot sense the heterogeneity of clusters,and the Shuffle process cannot allocate data according to the computing power of nodes.Aiming at this problem,the evaluation model of node computing performance is designed by analyzing Spark's working mechanism,and the node computing performance is evaluated by monitoring module.On this basis,an adaptive data partitioning strategy is proposed.By increasing the number of buckets,reducing the granularity of buckets,and combining with partition allocation algorithm,this strategy makes the Bucket data obtained from the Reducer end after the Shuffle process adapt to its computing capacity.Experiments show that the proposed adaptive partitioning strategy can reduce the load of job data and reduce the execution time of jobs in heterogeneous clusters.In summary,optimization of Spark's Shuffle process can alleviate network congestion,adapt to heterogeneous clusters,improve Shuffle execution efficiency,reducejob execution time,and ultimately improve the performance of Spark computing framework.

Keywords/Search Tags:

Spark shuffle, partial task first, network congestion, heterogeneous cluster, adaptive partition

PDF Full Text Request

Related items

1	Research On Shuffle Mechanism In Spark Cluster
2	Optimization Of Spark Task Scheduler For Shuffle Operators
3	Research On Task Execution Optimization In Spark
4	Research Of Task Scheduling Strategy For Heterogeneous Cluster In Spark Computing Environment
5	Research Of Task Partition And Resource Allocation Algorithms For Load Balance In Spark Computing Environment
6	A Research Of Straggler Strategy For Heterogeneous Spark Cluster
7	Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle
8	Research On Query Analysis And Optimization Based On Spark System
9	Research And Application Of Energy Efficiency Model And Task Scheduling Based On Heterogeneous Spark Cluster
10	Task Scheduling For Spark Application With Data Affinity In Heterogeneous Cluster