Font Size: a A A

Research On Spark Dual-phase Pipeline Task Scheduling Model Based On Network Load Perception

Posted on:2022-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:K L HeFull Text:PDF
GTID:2518306731487974Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In today's era,data is growing explosively.In order to meet the challenges of data processing in the era of big data,a new type of massive data processing platform featuring that moving computation towards data is cheaper than moving data towards computation has emerged,such as Spark.Spark is a very popular parallel processing framework in the era of big data,and task scheduling has a great impact on the performance of Spark clusters.Although task scheduling is considered an NP-complete problem,a large number of scholars have proposed many heuristic rules to obtain the approximate optimal solution of the problem.But most of them ignore the dynamic nature of resource requirements during task execution.This results in incomplete and unbalanced use of resources,which leads to poor cluster performance.Considering the overall task entire lives,the CPU utilization is often lower during the data transfer.Especially for most distributed data processing platforms,data transmission is time-consuming,which usually resulting in low overall CPU utilization.Similarly,network throughput during task calculations is also low.Based on this feature,this paper proposes a heuristic task scheduling algorithm based on the perception of network load changes,and on this basis,implements a dual-phase pipeline task scheduler(D2PTS)from the perspective of dynamic resource requirements that aims at maximizing cluster resource utilization,as a supplement to Spark existing scheduling mechanism.In detail,the main contributions of this article include:1)D2PTS divides the execution process into two phases according to the network load status of the task: network-intensive(network needed)and network-free(no network needed).In order to improve the overall resource utilization,this paper proposes different algorithms to evaluate the execution time of tasks in the network-intensive phase and the network-free phase respectively.2)This paper implements a dual-phase pipeline task scheduler.When the task being executed is in the network-free phase,D2 PTS can additionally schedule a new network-intensive task for execution at an appropriate time.Under this scheduling strategy,two tasks sharing the same CPU core can be executed in a coarse-grained pipeline mode.This execution method can start tasks earlier and improve the unevenness of resource utilization.3)Finally,the D2 PTS model prototype was implemented on the open source platform Spark 2.4.3.Based on the performance test platform Hi Bench,various types of loads were selected to perform experimental performance analysis on the task scheduling model implemented in this paper.Experimental results show that,compared with Spark default scheduling strategy,D2 PTS can not only reduce the execution time of the application(an average reduction of 10%),but also improve resource utilization.
Keywords/Search Tags:big data, dynamic resource requirements, dual-phase pipeline, task scheduling
PDF Full Text Request
Related items