Research On Spark Dual-phase Pipeline Task Scheduling Model Based On Network Load Perception

Posted on:2022-05-01

Degree:Master

Type:Thesis

Country:China

Candidate:K L He

Full Text:PDF

GTID:2518306731487974

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In today's era,data is growing explosively.In order to meet the challenges of data processing in the era of big data,a new type of massive data processing platform featuring that moving computation towards data is cheaper than moving data towards computation has emerged,such as Spark.Spark is a very popular parallel processing framework in the era of big data,and task scheduling has a great impact on the performance of Spark clusters.Although task scheduling is considered an NP-complete problem,a large number of scholars have proposed many heuristic rules to obtain the approximate optimal solution of the problem.But most of them ignore the dynamic nature of resource requirements during task execution.This results in incomplete and unbalanced use of resources,which leads to poor cluster performance.Considering the overall task entire lives,the CPU utilization is often lower during the data transfer.Especially for most distributed data processing platforms,data transmission is time-consuming,which usually resulting in low overall CPU utilization.Similarly,network throughput during task calculations is also low.Based on this feature,this paper proposes a heuristic task scheduling algorithm based on the perception of network load changes,and on this basis,implements a dual-phase pipeline task scheduler(D2PTS)from the perspective of dynamic resource requirements that aims at maximizing cluster resource utilization,as a supplement to Spark existing scheduling mechanism.In detail,the main contributions of this article include:1)D2PTS divides the execution process into two phases according to the network load status of the task: network-intensive(network needed)and network-free(no network needed).In order to improve the overall resource utilization,this paper proposes different algorithms to evaluate the execution time of tasks in the network-intensive phase and the network-free phase respectively.2)This paper implements a dual-phase pipeline task scheduler.When the task being executed is in the network-free phase,D2 PTS can additionally schedule a new network-intensive task for execution at an appropriate time.Under this scheduling strategy,two tasks sharing the same CPU core can be executed in a coarse-grained pipeline mode.This execution method can start tasks earlier and improve the unevenness of resource utilization.3)Finally,the D2 PTS model prototype was implemented on the open source platform Spark 2.4.3.Based on the performance test platform Hi Bench,various types of loads were selected to perform experimental performance analysis on the task scheduling model implemented in this paper.Experimental results show that,compared with Spark default scheduling strategy,D2 PTS can not only reduce the execution time of the application(an average reduction of 10%),but also improve resource utilization.

Keywords/Search Tags:

big data, dynamic resource requirements, dual-phase pipeline, task scheduling

PDF Full Text Request

Related items

1	Design And Implementation Of Task Resource Requirements Measuring System In Cloud Computing
2	Efficient Scheduling Mechanism For Dynamic Heterogeneous Chip Multiprocessors
3	Research On Dynamic Cloud Task Scheduling Algorithm Based On Resource Awareness
4	Research On Hybrid Task Scheduling Based On Users' QoS Requirements In Cloud Computing Environment
5	Research On Resourcescheduling And Load Balancingbased On Cloud Computing
6	MapReduce-based Resource Scheduling Model And Algorithm Research In Cloud Environment
7	Research On Dynamic Task Scheduling Strategy Based On Resource-matching Rate In Peer-to-Peer Network
8	Design And Implementation Of A Data Analysis Task Scheduling System For IVCE Cloud Platform
9	Energy Saving Mechanism Of Cloud Computing And Energy Efficiency Analysis Based On Task Scheduling And Dynamic Resource Scaling
10	An Artificial Immune Algorithm For Dynamic Task Scheduling On Cloud Compution Platform