Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology

Posted on:2021-04-22

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Z M Fu

Full Text:PDF

GTID:1488306122979889

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer technology,data from all walks of life have exploded exponentially.The era of big data has arrived,bringing huge opportunities and challenges to the Internet industry.On the one hand,big data resources contain huge social and commercial value.Effectively managing these data and mining the deep value of the data will have a profound impact on national governance,social management,corporate decision-making and personal life.On the other hand,big data has characteristics of 5Vs: volume,velocity,variety veracity and value.Traditional data processing systems and technologies have been difficult to meet the needs of big data processing.Currently,parallel processing is an effective way to process large amounts of data.Map Reduce has developed into a standard parallel programming model.As one of the popular open source implementations of the Map Reduce framework,Spark has the advantages of high efficiency,scalability,fault tolerance,and ease of use.It has received great attention from academia and is widely used in industry.Although Spark provides more powerful computing capabilities based on in-memory computing compared to Hadoop,it is still plagued by performance bottlenecks in actual use.Therefore,how to improve the performance of Spark in the face of processing big data is imminent.In view of this,this paper studies the performance optimization of Spark distributed computing framework based on memory computing from four aspects,including fault tolerance mechanism,task scheduling,data communication and task load balancing.The main work and innovations of this paper are as follows:Firstly,in terms of fault tolerance,an improved speculative execution strategy in heterogeneous environments is proposed to solve the straggler problem.Due to some inherent defects,the original speculative execution strategy of Spark cannot solve this problem effectively,and even causes performance degradation when in a heterogeneous environment.This paper focuses on solving three key problems of speculative execution in heterogeneous environments:straggler judgment,backup node selection,and effectiveness guarantee of speculative tasks.In addition,in order to minimize straggler's misjudgment,the influence of data locality and data skew factors are considered.We evaluate performance in a Spark cluster by using multiple microbenchmarks(Sort and Word Count),macro-benchmarks(Kmeans and LDA),and Hi Bench.Experimental results show that the proposed strategy improves straggler's judgment accuracy to80%,recall rate to more than 90%,and average search time is reduced by more than 60 seconds.Secondly,in terms of task scheduling,a locality-aware task scheduling algorithm is proposed.Spark task scheduler uses a greedy scheduling strategy that does not consider the interaction between task placement,resulting in the local optimization of data locality.In this paper,for different communication modes in Map and Reduce stage,it uses the bipartite graph to uniformly model the Map and Reduce task scheduling,and then a task scheduling scheme that minimizes the total communication cost is formulated and converted into a graph problem for solving.We evaluate performance in a Spark cluster by using multiple micro-benchmarks(Word Count and Join),macro-benchmarks(Page Rank and LDA),and Hi Bench.Experimental results show that compared with other algorithms,the proposed task scheduling algorithm can reduce the job execution time by 35% and the network traffic by 38%.Thirdly,in terms of data communication,an executor allocation method is firstly proposed to optimize the total communication distance.Spark provides two executor allocation methods:Spread Out and No Spread Out,which may lead to a long data transmission distance between tasks.We calculated the executor distance matrix and formulated an executor allocation scheme that minimizes the total communication distance.Then,for the cases where the distance between executors satisfies and does not satisfy the triangle inequality,an optimal executor allocation approximation algorithm and an executor set expansion algorithm are proposed respectively.We evaluate performance in a Spark cluster by using multiple micro-benchmarks(Sort and Join)and macro-benchmarks(Page Rank and LDA).Experimental results show that the proposed algorithm can reduce 24%?45% of the data communication delay of tasks.Fourthly,in terms of task load balancing,an adaptive intermediate data partitioning method is proposed to make the data partitions even in the shuffle stage.The hash partitioner and range partitioner provided in Spark can easily cause load imbalance of reduce tasks,which particularly affects the performance of jobs in Spark Streaming environment.This paper estimates the distribution of intermediate data for the next batch of job based on the previously processed micro-batches.Then,for the uneven distribution of intermediate data,a series of optimization measures are proposed based on the range partitioning scheme,and among them,the partition balance before and after the shuffle operation is particularly considered.We evaluate performance in a Spark cluster by using multiple micro-benchmarks(Word Count and Sort)and macro-benchmarks(Page Rank and LDA).Experimental results show that the proposed partitioning method can balance the load of Reduce tasks.The work of this paper has great theoretical and application value,especially in the context of big data,improving the performance of the Spark framework and making full use of the parallel processing capabilities of the big data framework have greatly practical significance in improving the performance of various applications when processing mass data.

Keywords/Search Tags:

MapReduce, Spark, Speculative execution, Data skew, Communication delay, Load balancing

PDF Full Text Request

Related items

1	An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment
2	A Research Of Load Balancing Algorithms For Data Skew In Spark
3	Load Balancing Algorithm Based On Data Skew Of MapReduce
4	Research Of Data Skew On Spark Based On Imporved Partition Method
5	Study On Performance Optimization Of MapReduce
6	The Research Of Load Balancing In Mapreduce Based On Sampling Estimation
7	The Research Of Skew With Sampling Technique In MapReduce
8	Research On Lightweight Load Balancing Under Mapreduce
9	Research And Strategy On Data Skew Problem Based On MapReduce
10	Research On Partition Loading Balance Based On Spark Data Skew