Font Size: a A A

Research On Shuffle Mechanism In Spark Cluster

Posted on:2018-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y C XiaFull Text:PDF
GTID:2348330569986438Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Apache Spark is a fast and general engine for large-scale data processing and became a Top-Level Apache Project.By efficient computing based on memory,Spark solve two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.As with other distributed data processing platforms,it is common to collect data in a manyto-many fashion,a stage traditionally known as the shuffle phase.In Spark,many sources of inefficiency exist in the shuffle phase that,once addressed,potentially promise vast performance improvements.The existing shuffle strategy encounters the problems of long intermediate data shuffle time and noticeable network overhead.According to the distribution information of the intermediate data,this paper proposes a partition strategy based on data locality to reduce the amount of data transmission during the shuffle process,and improve the overall performance of the shuffle.The main work of this paper is as follows:1.In order to solve the problem that the output of map task is too concentrated which cause the uneven load of different nodes,this paper proposes an optimized shuffle scheduling strategy.During the map phase,according to the locality level of the task to start a small amount of task to disperse the output in the cluster.During the shuffle fetch phase,the data is selected according to the network load of the node.Compared with the existing scheduling strategies,the experiment shows that the balanced scheduling strategy can improve the efficiency of shuffle and reduce the average execution time of tasks.2.Aiming at the problem of data skew during shuffle process,this paper proposes a locality-aware partitioning strategy to save the network bandwidth dissipation during the shuffle phase and balancing the redcucers' inputs.Compared with the hash partition strategy,the results show that the proposed method can reduce the amount of data transmission during the shuffle process,and balance the size of each reduce task iuputs.In summary,there are many problems such as uneven network load of nodes and data skew during shuffle process.Based on the locality level of task,this paper proposes map task restarting strategy and local-based partitioning strategy,which can significantly alleviate the network load and data skew.
Keywords/Search Tags:Spark, shuffle, locality, data skew, partition
PDF Full Text Request
Related items