Task Scheduling For Spark Application With Data Affinity In Heterogeneous Cluster

Posted on:2021-03-28

Degree:Master

Type:Thesis

Country:China

Candidate:H A Du

Full Text:PDF

GTID:2518306557489634

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The current task scheduling strategies provided by Spark are based on homogeneous environments.However,heterogeneous clusters are commonly existed in data centers.Minimizing the movement of data across networks by allocating tasks closer to data is called data affinity.A problem with data affinity is considered to schedule Spark tasks in a heterogeneous cluster.The objective is to minimize the makespan of a Spark application.The main challenges include two aspects:(i)How to generate an appropriate topological stage order in a complex DAG composed of jobs and stages.(ii)How to obtain an optimal tradeoff between data affinity and system load balance for minimizing the makespan.According to the characteristics of Spark applications,a task scheduling framework with data affinity is constructed.An optimization algorithm based on heuristic rules is proposed.The algorithm consists of four components: stage sorting,task sequencing,resource allocating and scheduling solution improving.Based on estimated processing speeds of virtual machines,a strategy is developed for calculating temporal parameters of stages.In order to obtain stage topology sequence,four stage priority calculation rules are introduced: Earliest Start Time First,Max Estimation Duration First,Min Float Time First and RNDM.Task scheduling sequence is generated by three task sequencing rules.Considering both data affinity and load balance,a virtual machine list is dynamically managed,and four virtual machine searching strategies are presented: Highest Speed First,Earliest Availability Time First,Earliest Finish Time First and RNDM.A schedule improvement method utilizing idle time slots between tasks is designed to further search for better solutions.To verify the performance of the proposed algorithm,the multi-factor analysis of variance(ANOVA)is adopted to calibrate the algorithm parameters,then the best parameter combination is determined by means plots.The proposed algorithm and baseline algorithms are compared and analyzed from different aspects.Experimental results indicate that the proposed algorithm outperforms baseline algorithms.

Keywords/Search Tags:

Heterogeneous Cluster, Spark, Data Affinity, Task Scheduling

PDF Full Text Request

Related items

1	Research Of Task Scheduling Strategy For Heterogeneous Cluster In Spark Computing Environment
2	Research And Application Of Energy Efficiency Model And Task Scheduling Based On Heterogeneous Spark Cluster
3	A Research Of Straggler Strategy For Heterogeneous Spark Cluster
4	Scheduling Spark Tasks To Heterogeneous Cluster
5	Research On Spark Task Scheduling Technology Based On Execution Time Prediction
6	Rcscarch On Construction Of ETL Cluster Model Based On Task Scheduling
7	Design And Implementation Of A Heterogeneous Data Source Exchange System Based On Spark
8	The Elastic Resource Allocation And Task Scheduling Of Spark
9	Research And Implementation Of Heterogeneous Computing Cluster Scheduling System
10	Optimization Of Spark Task Scheduler For Shuffle Operators