Font Size: a A A

Research On Deep Learning Task Scheduling Based On Small Scale GPU Cluster Platform

Posted on:2020-05-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y ChenFull Text:PDF
GTID:1488306548491664Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,the core technology represented by deep learning(DL)has triggered the third wave of artificial intelligence(AI).From Internet giants to small and medium-sized enterprises,from research institutes to universities,academia and industry have focused on the research and exploration of DL techniques.Although the dedicated hardwares for deep learning represented by TPU are emerging,GPU cluster still dominates in DL research and development(R&D).Compared with the large-scale customized DL platform launched by giant AI companies,most research institutes and small-and-medium-sized enterprises prefer to adopt costeffective small-scale GPU cluster as a DL R&D platform for multi-user sharing due to limited budget.In this research background,how to improve the resource utilization of GPU platform and to improve the throughput of DL tasks is a very practical research direction.In order to solve the mentioned challenges,this paper focuses on small-scale GPU clusters in the DL R&D scenario.Based on the evaluation and analysis of DL tasks,we propose a series of scheduling strategies to improve the task processing efficiency.The primary contributions and innovations of this paper are as follows:1.Since that the DL R&D platform only deals with various DL tasks,we evaluate and analyze the DL tasks before designing the scheduling algorithms.This paper summarizes the features of DL tasks from the aspects of network structure,computing flow,communication mode,implementation framework,various application hyper-parameters and distributed parameters.Moreover,based on a small-scale GPU cluster,this paper evaluates the typical DL networks from the perspective of task throughput,GPU resource utilization,memory usage,GPU scalability and GPU locality,concluding the characteristics of DL tasks.The analysis and conclusion are adopted as the important basis for the design of subsequent scheduling algorithms.2.Based on the evaluation and analysis of DL tasks,we propose a QoSaware dynamic scheduling framework GENIE.The framework mainly includes an offline profiling module and an online scheduling module.GENIE can analyze the characteristics of DL tasks and build a performance prediction model according to the profiling results derived from a lightweight offline profiler.Based on the performance prediction model,GENIE can dynamically select the best placement solution for each task online and schedule it on the GPU cluster.Experiments on a16-GPU cluster and a simulator demonstrate that GENIE can achieve higher QoS guarantee and resource utilization than other baseline scheduling algorithms.3.Considering that prior prediction-based schedulers are limited in terms of their prediction accuracy and offline-profiling overhead,an online reinforcement learning(RL)-based scheduling strategy is proposed in this paper.The RL-based scheduling strategy adopts Q-learning framework to model the R&D scenarios,and proposes a series of implementations including state space,action space,reward function and update scheme.The learning agent can learn from the feedback of task performance independently and continuously to adjust online task scheduling decisions.Experiments on GPU clusters demonstrate that the RL-based scheduler has significantly improved the task average normalized throughput and the makespan.Moreover,our proposed RL-based scheduler is more suitable for the long-term DL R&D scenarios.4.In order to improve the GPU utilization under the exclusive task scheduling strategies,a GPU shared scheduling strategy is proposed based on memory efficiency.The scheduler exploits the network model structure information of DL tasks to calculate its computations and GPU memory usages under diverse placements.Moreover,we adopt the indicator of memory efficiency,which is measured by the computational scale under the GPU memory occupation,to estimate different placements for the DL tasks.Based on the memory efficiency,a heuristic scheduling algorithm is proposed to realize the multi-task sharing of GPU resources,and further improve the system resource utilization and task completion rate.
Keywords/Search Tags:GPU Cluster, Task Scheduling, Deep Learning, Research and Development Platform, QoS Scheduling, Reinforcement Learning
PDF Full Text Request
Related items