Font Size: a A A

Spark Task Scheduling With Data Skew And Deadline Constraints

Posted on:2021-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:Z P LuFull Text:PDF
GTID:2518306557489734Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development and popularization of the Internet of Things,mobile Internet and cloud computing,the data generated has shown an ”exponential explosion” growth.In order to process large amounts of data more efficiently,some small businesses or individuals often need to build private Hadoop / Spark clusters.It is not only a complicated operation,but also requires a high cost to purchase a large number of machines,so deploying Spark applications on the public cloud may be a better choice.There are still many challenges in the distributed parallel computing framework on the cloud platform,among which data skew is becoming one of bottlenecks to improve performance.Data skew has an important impact on the performance of big data processing.Spark task scheduling with data skew and deadline constraints is considered to minimize the rental cost.The main challenges are:(i)Data skew is a qualitative concept that requires the definition and the quantification of data skew.(ii)How to break the inter-constraint tie between the data skew and deadlines.(iii)How to obtain the optimal scheduling sequence in the complex DAG composed of jobs and stages.In this thesis,the existing scheduling system architecture is modified and a new one is proposed.A mathematical model is established based on the architecture.A Spark task scheduling algorithm is proposed with both data skew and deadline constraints.The algorithm consists of three parts: stage sequencing,task scheduling and scheduling adjustment.In the stage sequencing part,four priority rules are proposed: HSF(Heaviest Skew First),LSRF(Largest Skew Rate First),LDF(Largest Data First)and RAND(RANDOM).The task scheduling part includes four steps: task classification,virtual machine type selection,available resource searching and virtual machine instance selection.In virtual machine type selection,four strategies are proposed: STF(Skewed Task First),LTF(Largest Task First),SLTF(Skewed and Largest Task First)and RAND(RANDOM).In virtual machine instance selection,three strategies are proposed: EATF(Earliest Available Time First),LRTIF(Longest Remaining Time Interval First)and RAND(RANDOM).The initial solution can be obtained by task scheduling.Scheduling adjustment is based on the initial solution,FTSM(Fragmented Time Slice Merging)and ITSF(Idle Time Slice Filling)are used to optimize the rental cost.To evaluate the performance of the proposed algorithm,the multi-factor analysis of variance(ANOVA)is adopted to calibrate the parameters and components of the algorithm.Two related algorithms are used as baseline algorithms.The performance difference is compared and analyzed between the proposed algorithm and the baseline algorithms from different aspects.Experimental results show that the proposed algorithm outperforms the compared algorithms.
Keywords/Search Tags:Data Skew, Spark, Scheduling Optimization, Cloud Computing
PDF Full Text Request
Related items