Font Size: a A A

The Elastic Resource Allocation And Task Scheduling Of Spark

Posted on:2018-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:D ChenFull Text:PDF
GTID:2428330590477761Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the blooming development of Internet,PBs of data are produced by the Internet every day.These data is very important and valuable.People put more and more focus on data and try to find the deep wealth of data.However,it is hard to compute and analyze such large amounts of data by a single machine with the limit of hardware performance.Parallel computing and districuted system is a good solution.We can split large data into some tasks and these tasks run parallel by different machines.Spark is one of the most well-known parallel computing platforms in the world and it improves performance of parallel computing by in-memory computing and persisting.However,there are still some common problems in Spark and other parallel platforms,such as skew and memory tuning etc.Skew is a phenomenon that some machines in the cluster have lower performance than others or some tasks in the job runs slower than others,because of the data allocation inbalance,computing resource inbalance,etc.The performance difference may cause performance issue in parallel computing.This paper introduces an elastic task allocation strategy based on tiny tasks,which aims to reduce the impact of data skew in Spark.In addition,the paper also introduces a new task scheduling mechanism based on elastic task granularity and task difficulty with reliability and fault-tolerance.Since task granularity should be different with different applications,stages and nodes,elastic task granularity can tune task granularity elastically according to the running status of the cluster.With elastic task granularity computing,it is easier to get load balance among different nodes and reduce the impact of skew in Spark.Since both task locality and failure times will impact task performance,task difficulty is introduced and it is composed of task locality and failure times.The scheduling mechanism based on taks granularity and difficulty mainly transfers task difficulty to task granularity and do fast task scheduling according to the task granularity.The task scheduling mechanism implements a fast and efficient scheduling methods and also provide good fault tolerance and reliability by combining Spark fault-tolerance mechanism with elasitc task.This paper implements the resource allocation and task scheduling mechanism by rewriting some modules in source code of Spark.It reuses the core code of Spark scheduling and task management to reuse the fault-tolerance of the original Spark.Based on the code,this paper implements elastic task model,dynamic granularity computing and task scheduling mechanism based on task granularity and difficulty.In this paper,there are also some experiments to verify the accuracy of the system.The experiments include impact of tiny tasks in reducing skew,performance and resource usage improvement compared with Spark and fault-tolerance verification.It is proved that the system can greatly reduce the data skew impact in Spark and it has a great performance and resource usage improvement compared with original Spark.
Keywords/Search Tags:Spark, Parallel Computing, Data Skew, Task Scheduling
PDF Full Text Request
Related items