Research On Shuffle Mechanism In Spark Cluster

Posted on:2018-10-01

Degree:Master

Type:Thesis

Country:China

Candidate:Y C Xia

Full Text:PDF

GTID:2348330569986438

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Apache Spark is a fast and general engine for large-scale data processing and became a Top-Level Apache Project.By efficient computing based on memory,Spark solve two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.As with other distributed data processing platforms,it is common to collect data in a manyto-many fashion,a stage traditionally known as the shuffle phase.In Spark,many sources of inefficiency exist in the shuffle phase that,once addressed,potentially promise vast performance improvements.The existing shuffle strategy encounters the problems of long intermediate data shuffle time and noticeable network overhead.According to the distribution information of the intermediate data,this paper proposes a partition strategy based on data locality to reduce the amount of data transmission during the shuffle process,and improve the overall performance of the shuffle.The main work of this paper is as follows:1.In order to solve the problem that the output of map task is too concentrated which cause the uneven load of different nodes,this paper proposes an optimized shuffle scheduling strategy.During the map phase,according to the locality level of the task to start a small amount of task to disperse the output in the cluster.During the shuffle fetch phase,the data is selected according to the network load of the node.Compared with the existing scheduling strategies,the experiment shows that the balanced scheduling strategy can improve the efficiency of shuffle and reduce the average execution time of tasks.2.Aiming at the problem of data skew during shuffle process,this paper proposes a locality-aware partitioning strategy to save the network bandwidth dissipation during the shuffle phase and balancing the redcucers' inputs.Compared with the hash partition strategy,the results show that the proposed method can reduce the amount of data transmission during the shuffle process,and balance the size of each reduce task iuputs.In summary,there are many problems such as uneven network load of nodes and data skew during shuffle process.Based on the locality level of task,this paper proposes map task restarting strategy and local-based partitioning strategy,which can significantly alleviate the network load and data skew.

Keywords/Search Tags:

Spark, shuffle, locality, data skew, partition

PDF Full Text Request

Related items

1	Research On Data Skew Optimization In Spark Computing Framework
2	Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle
3	Research On Task Execution Optimization In Spark
4	Research On Partition Loading Balance Based On Spark Data Skew
5	Research On Spark Shuffle Process Performance Optimization
6	Research And Implementation On Anti-skew Spark Intermediate Data Partition Mechanism
7	Research Of Data Skew On Spark Based On Imporved Partition Method
8	Optimization Of Spark Task Scheduler For Shuffle Operators
9	Research And Optimization Of Data Placement Method In Spark Partitioner
10	Research Of Task Partition And Resource Allocation Algorithms For Load Balance In Spark Computing Environment