Font Size: a A A

Task Scheduling For Spark Application With Data Affinity In Heterogeneous Cluster

Posted on:2021-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:H A DuFull Text:PDF
GTID:2518306557489634Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The current task scheduling strategies provided by Spark are based on homogeneous environments.However,heterogeneous clusters are commonly existed in data centers.Minimizing the movement of data across networks by allocating tasks closer to data is called data affinity.A problem with data affinity is considered to schedule Spark tasks in a heterogeneous cluster.The objective is to minimize the makespan of a Spark application.The main challenges include two aspects:(i)How to generate an appropriate topological stage order in a complex DAG composed of jobs and stages.(ii)How to obtain an optimal tradeoff between data affinity and system load balance for minimizing the makespan.According to the characteristics of Spark applications,a task scheduling framework with data affinity is constructed.An optimization algorithm based on heuristic rules is proposed.The algorithm consists of four components: stage sorting,task sequencing,resource allocating and scheduling solution improving.Based on estimated processing speeds of virtual machines,a strategy is developed for calculating temporal parameters of stages.In order to obtain stage topology sequence,four stage priority calculation rules are introduced: Earliest Start Time First,Max Estimation Duration First,Min Float Time First and RNDM.Task scheduling sequence is generated by three task sequencing rules.Considering both data affinity and load balance,a virtual machine list is dynamically managed,and four virtual machine searching strategies are presented: Highest Speed First,Earliest Availability Time First,Earliest Finish Time First and RNDM.A schedule improvement method utilizing idle time slots between tasks is designed to further search for better solutions.To verify the performance of the proposed algorithm,the multi-factor analysis of variance(ANOVA)is adopted to calibrate the algorithm parameters,then the best parameter combination is determined by means plots.The proposed algorithm and baseline algorithms are compared and analyzed from different aspects.Experimental results indicate that the proposed algorithm outperforms baseline algorithms.
Keywords/Search Tags:Heterogeneous Cluster, Spark, Data Affinity, Task Scheduling
PDF Full Text Request
Related items