Spark Task Scheduling With Data Skew And Deadline Constraints

Posted on:2021-06-13

Degree:Master

Type:Thesis

Country:China

Candidate:Z P Lu

Full Text:PDF

GTID:2518306557489734

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development and popularization of the Internet of Things,mobile Internet and cloud computing,the data generated has shown an �exponential explosion� growth.In order to process large amounts of data more efficiently,some small businesses or individuals often need to build private Hadoop / Spark clusters.It is not only a complicated operation,but also requires a high cost to purchase a large number of machines,so deploying Spark applications on the public cloud may be a better choice.There are still many challenges in the distributed parallel computing framework on the cloud platform,among which data skew is becoming one of bottlenecks to improve performance.Data skew has an important impact on the performance of big data processing.Spark task scheduling with data skew and deadline constraints is considered to minimize the rental cost.The main challenges are:(i)Data skew is a qualitative concept that requires the definition and the quantification of data skew.(ii)How to break the inter-constraint tie between the data skew and deadlines.(iii)How to obtain the optimal scheduling sequence in the complex DAG composed of jobs and stages.In this thesis,the existing scheduling system architecture is modified and a new one is proposed.A mathematical model is established based on the architecture.A Spark task scheduling algorithm is proposed with both data skew and deadline constraints.The algorithm consists of three parts: stage sequencing,task scheduling and scheduling adjustment.In the stage sequencing part,four priority rules are proposed: HSF(Heaviest Skew First),LSRF(Largest Skew Rate First),LDF(Largest Data First)and RAND(RANDOM).The task scheduling part includes four steps: task classification,virtual machine type selection,available resource searching and virtual machine instance selection.In virtual machine type selection,four strategies are proposed: STF(Skewed Task First),LTF(Largest Task First),SLTF(Skewed and Largest Task First)and RAND(RANDOM).In virtual machine instance selection,three strategies are proposed: EATF(Earliest Available Time First),LRTIF(Longest Remaining Time Interval First)and RAND(RANDOM).The initial solution can be obtained by task scheduling.Scheduling adjustment is based on the initial solution,FTSM(Fragmented Time Slice Merging)and ITSF(Idle Time Slice Filling)are used to optimize the rental cost.To evaluate the performance of the proposed algorithm,the multi-factor analysis of variance(ANOVA)is adopted to calibrate the parameters and components of the algorithm.Two related algorithms are used as baseline algorithms.The performance difference is compared and analyzed between the proposed algorithm and the baseline algorithms from different aspects.Experimental results show that the proposed algorithm outperforms the compared algorithms.

Keywords/Search Tags:

Data Skew, Spark, Scheduling Optimization, Cloud Computing

PDF Full Text Request

Related items

1	Research On Partition Loading Balance Based On Spark Data Skew
2	Research And Optimization Of Adaptive Techniques For Mitigating Skew In Spark
3	Research On Optimization Mechanism Of Containerized Spark Resource Scheduling In Cloud Environment
4	Research On Data Skew Optimization In Spark Computing Framework
5	Research Of Performance Optimization For Data Skew Based On High-speed Networks
6	The Elastic Resource Allocation And Task Scheduling Of Spark
7	The Research Of Scheduling Algorithms For Performance And Energy Consumption Under The Condition Of Data Skew
8	Research On Spark Data Skewing Improvement And Decision Tree Parallelization Application Under Cloud Edge Collaboration
9	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
10	Research Of Task Partition And Resource Allocation Algorithms For Load Balance In Spark Computing Environment