Font Size: a A A

Research On Task Execution Optimization In Spark

Posted on:2020-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:M Y DuFull Text:PDF
GTID:2428330623459876Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a memory-based distributed computing framework,Spark has been widely used in big data processing systems.Spark adopts the Hadoop MapReduce computing model,but it uses RDD for data processing,effectively avoiding a large number of disk I/O operations in computing process,imporving the system performance.However,Spark still has some shortcomings in the Shuffle phase:(1)the Partition skew problem in the Shuffle Write phase;(2)the node load skew problem in the Shuffle Read phase.In view of the shortcomings in Spark,this dissertation studies the task execution optimization technology of the Spark Shuffle phase.Firstly,for the problem of Partition skew in the Shuffle Write phase,this dissertation proposes a balanced Spark data partitioner called BSPartitioner(Balanced Spark Partitioner).By deeply analyzing the data partitioning characteristics of the Shuffle stage,the equalization partitioning model of intermediate data is established.The model aims to minimize the tilt of Partitions and find a balanced partitioning strategy of intermediate data in Shuffle stage.Based on the model,the equalization data partitioning algorithm of BSPartitioner is designed and implemented.This algorithm transforms the equalization partitioning problem into a classic List-Scheduling task scheduling problem,effectively realizes the balanced partitioning of Shuffle intermediate data and improves the execution efficiency of Spark.Secondly,for the problem of node load skew in the Shuffle Read phase,this dissertation proposes a cost-based Shuffle Read Partition placement algorithm called SPOC(Shuffle Partition Placement Based on Cost).By transforming the Partition placement problem into the node load balancing problem,a balanced node load model is established.The model aims to minimize the maximum load between nodes by finding a proper Partition placement strategy.Based on the model,the SPOC algorithm uses a two-stage optimization method to obtain a suitable Partition placement strategy for the Shuffle Read phase,which realizes load balancing between nodes,and further improves the execution efficiency of Spark.Finally,we have redeveloped the Spark open source computing system based on the research work of this dissertation.The Spark computing cluster was built and the benchmark data tool TPC-D was used to generate experiment data of different data skew and data volume for experiment analysis.The feasibility and effectiveness of the research work in this dissertation are verified.
Keywords/Search Tags:Hadoop, Spark, Shuffle, Skew partition, Cost-based optimization
PDF Full Text Request
Related items