Font Size: a A A

The Research On Spark Task Scheduling Strategy Based On Dynamic Memory Awareness

Posted on:2021-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:A L ZengFull Text:PDF
GTID:2518306122968629Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In today's big data era,large-scale data processing is mainly based on distributed parallel processing,and scheduling plays an important role in improving the performance of big data parallel processing frameworks.Spark is the latest technological progress in the field of big data processing.It is an in-memory parallel computing framework that uses a multi-threaded task scheduling model.In the process of Spark task scheduling,it does not consider memory resources,and the number of task threads that can be executed concurrently in the executor is determined by the user setting parameters.This poses a potential limitation on the execution performance of task threads and the application.In order to overcome this limitation in the Spark task scheduling,this paper proposes a dynamic memory-aware task scheduling strategy in Spark(DMATS),which takes memory resources into consideration and adjust the task concurrency through static and dynamic methods without violation of the data-locality task scheduling principle,so that the concurrent execution of tasks is always the most suitable for the executor's computing resources.Specifically,the main contributions of this paper include:1)Propose a task data statistics method to calculate the amount of data that the task needs to process,which determines the resource requirements of the task.This method analyzes the Spark execution engine based on RDD and looks up the existing relevant information to calculate the amount of the task processing data,which causes a little additional computing overhead and communication delay,and can obtain the actual accurate results.2)Propose an adaptive algorithm for calculating the initial task concurrency in the executor to determine the number of thread tasks that can be scheduled at the beginning.On the basis of ensuring the performance of the existing scheduling mechanism is not damaged,the algorithm considers the task resource requirements and the available execution memory resources of the executor to get the initial task concurrency of adaptive memory resources.3)Propose a task concurrency dynamic adjustment algorithm,which can dynamically adjust the number of concurrent tasks based on the feedback of memory usage of previously completed tasks.Within the alterations of the dynamic adjustment,the scheduling can greatly meets the demand for the use of memory resources when the tasks is running,and it can not only improve resource utilization,but also improve the overall performance of the Spark platform.4)Summarizing the research results mentioned above,based on the open source platform of Spark,the system of the dynamic memory-aware task scheduling strategy is realized.The system has applied the proposed task data statistics method,and the initial static and post-order dynamic concurrency adjustment algorithm.Two typical types of workloads are selected to test the performance and resource usage of the task scheduling strategy.The results show that compared with the native Spark scheduling strategy,its application execution time is shortened by a maximum of 43.64%,an average of 27.8%,CPU and memory resource utilization has also been significantly improved,an average increase of5.7% and 12.3%.And compared with other works of Spark-based task scheduling strategies,the improvement effect is on average 10.6% higher.
Keywords/Search Tags:Big Data, Spark, Task Scheduling, Concurrency, Memory Resouces
PDF Full Text Request
Related items