Font Size: a A A

Research On Spark Task Scheduling Technology Based On Execution Time Prediction

Posted on:2018-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:S Y LiuFull Text:PDF
GTID:2348330563952538Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Spark is a state-of-the-art big data processing platform,which uses directed acyclic graphs(DAGs)to divide jobs into multiple execution stages given the underlying data processing logic.Across multiple stages where data dependencies exist,operations are executed in serial;while within each stage,data are processed by several tasks in parallel.Intermediate data tasks in Spark refer to the tasks that run in the intermediate and final execution stages of the Spark jobs.The common characteristic of these tasks is that the data processed are generated by the previous stage and distributed across the computing nodes.The current Spark platform uses the Delay Scheduling strategy for intermediate data tasks.This strategy aims to maximize the localization of the data of each task,that is,to schedule every task on the nodes where the proportions of data required for processing are relatively high.When a task waits longer than a certain time threshold,it will be rescheduled to other nodes.Nevertheless,Delay Scheduling policy is based on a local optimization of the current idle nodes.Due to the lack of perception of the execution time of the tasks,the Delay Scheduling strategy makes it difficult to achieve a global optimal scheduling decision for the entire task set.To address above problems,this paper proposes an intermediate data task scheduling strategy in Spark based on processing time evaluation techniques.The core idea of this strategy is to estimate the execution time of each task at each computing node according to the set of available computing resources and data distribution among relevant computing nodes in the cluster before the task is run.The strategy can predict the idle time of the computing nodes,thus provides the basis for a global optimal scheduling of the task set.The main contributions of this paper include:1)Define the task processing time evaluation model based on a multi-stage pipeline.Spark platform intermediate data tasks can be divided into the following five stages: data fetching,data aggregation,data merging,data calculation and data storage.The first four stages are abstractly modeled in pairs with two pipeline models.Based on the task processing model and the principle of the pipeline model,our task execution time evaluation method combines the above five stages with authentic quantitative calculation so that it is well grounded and specified.2)Propose a long-task first scheduling strategy.According to the characteristics of the Spark task in parallel execution,a long-task prioritized scheduling strategy is proposed to achieve the time symmetry of parallel task between nodes.Given the length of the evaluated execution time of the task set,the strategy sorts all tasks in descending order.Postponing the scheduling of shorter tasks achieves the purpose of using short tasks to fill in the gaps between several executions of each computing node so that it ensures the symmetry of the task execution time among nodes and shortens the execution time overall.3)Summarize the above research results tested on the Spark platform;design and achieve the Spark task scheduling technology prototype system based on implementation of the task time evaluation.The strategy integrates the above task execution time evaluation and pre-allocation techniques.Typical benchmark data in BigDataBench are used on the strategy performance test.Results show that compared with the original Spark Delay Scheduling strategy,the stage processing time is shortened up to 25.1% per stage,and 13.2% on average.
Keywords/Search Tags:Big Data, Spark, intermediate data task, task scheduling, data locality
PDF Full Text Request
Related items