Research On Task Execution Optimization In Spark

Posted on:2020-06-13

Degree:Master

Type:Thesis

Country:China

Candidate:M Y Du

Full Text:PDF

GTID:2428330623459876

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As a memory-based distributed computing framework,Spark has been widely used in big data processing systems.Spark adopts the Hadoop MapReduce computing model,but it uses RDD for data processing,effectively avoiding a large number of disk I/O operations in computing process,imporving the system performance.However,Spark still has some shortcomings in the Shuffle phase:(1)the Partition skew problem in the Shuffle Write phase;(2)the node load skew problem in the Shuffle Read phase.In view of the shortcomings in Spark,this dissertation studies the task execution optimization technology of the Spark Shuffle phase.Firstly,for the problem of Partition skew in the Shuffle Write phase,this dissertation proposes a balanced Spark data partitioner called BSPartitioner(Balanced Spark Partitioner).By deeply analyzing the data partitioning characteristics of the Shuffle stage,the equalization partitioning model of intermediate data is established.The model aims to minimize the tilt of Partitions and find a balanced partitioning strategy of intermediate data in Shuffle stage.Based on the model,the equalization data partitioning algorithm of BSPartitioner is designed and implemented.This algorithm transforms the equalization partitioning problem into a classic List-Scheduling task scheduling problem,effectively realizes the balanced partitioning of Shuffle intermediate data and improves the execution efficiency of Spark.Secondly,for the problem of node load skew in the Shuffle Read phase,this dissertation proposes a cost-based Shuffle Read Partition placement algorithm called SPOC(Shuffle Partition Placement Based on Cost).By transforming the Partition placement problem into the node load balancing problem,a balanced node load model is established.The model aims to minimize the maximum load between nodes by finding a proper Partition placement strategy.Based on the model,the SPOC algorithm uses a two-stage optimization method to obtain a suitable Partition placement strategy for the Shuffle Read phase,which realizes load balancing between nodes,and further improves the execution efficiency of Spark.Finally,we have redeveloped the Spark open source computing system based on the research work of this dissertation.The Spark computing cluster was built and the benchmark data tool TPC-D was used to generate experiment data of different data skew and data volume for experiment analysis.The feasibility and effectiveness of the research work in this dissertation are verified.

Keywords/Search Tags:

Hadoop, Spark, Shuffle, Skew partition, Cost-based optimization

PDF Full Text Request

Related items

1	Research On Data Skew Optimization In Spark Computing Framework
2	Research On Shuffle Mechanism In Spark Cluster
3	Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle
4	Research On Spark Shuffle Process Performance Optimization
5	Optimization Of Spark Task Scheduler For Shuffle Operators
6	Research And Optimization Of Data Placement Method In Spark Partitioner
7	Research On Partition Loading Balance Based On Spark Data Skew
8	Research Of Optimization Of Hadoop MapReduce Shuffle Phase
9	Research Of Data Skew On Spark Based On Imporved Partition Method
10	Research And Implementation On Anti-skew Spark Intermediate Data Partition Mechanism