Research And Application Of GPU Scheduling Strategy And Task Parallelization Method On Deep Learning Cloud Platform

Posted on:2021-06-02

Degree:Master

Type:Thesis

Country:China

Candidate:X Geng

Full Text:PDF

GTID:2518306308478544

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the widespread use of GPUs in large-scale deep learning scenarios,the efficient execution of multiple deep learning jobs in GPU cluster has attracted great attention.The deep learning cloud platform integrates multiple GPU computing resources and can achieve efficient processing of large-scale deep learning tasks.However,the GPU scheduling strategy of the deep learning cloud platform based on Kubernetes treats GPU as the smallest resource allocation unit and allocates it to the container completely,which will cause GPU resources to be underutilized.At the same time,when multiple deep learning tasks are executed together,an unreasonable task parallelization strategy will cause resource competition between tasks.On the other hand,an unreasonable task parallelization strategy for the deep learning training tasks that require multiple GPUs will increase the communication cost between deep learning tasks,which leads to longer task execution time.First of all,this paper analyzes the resource scheduling strategy of Kubernetes container allocation management platform and proposes a fine-grained GPU scheduling optimization strategy,which comprehensively considers the GPU resource requirements of deep learning jobs and the GPU resource usage of each node in the cluster.This strategy avoids scheduling jobs with similar resource requirements to the same node,thereby enabling balanced use of multidimensional resources on the cloud platform nodes.Secondly,we propose an interference-aware performance prediction model,which can predict the impact of interference on the performance of multiple co-execution deep learning tasks on the GPU.And we propose an interference-aware and topology-aware deep learning task parallelization strategy based on the prediction model.According to the characteristics of deep learning tasks,our strategy can schedule deep learning tasks into appropriate GPUs taking into account the interference measured by co-executing tasks slowdown and the communication cost among tasks.Finally,we use Docker container technology to implement large-scale deep learning job mirrors and verify our GPU scheduling optimization strategy and deep learning task parallelization method proposed above in a deep learning cloud platform.Experimental results show that the GPU scheduling optimization strategy and the task parallelization method proposed in this paper can effectively improve the GPU utilization in the platform and the execution efficiency of deep learning jobs.

Keywords/Search Tags:

deep learning, cloud platform, resource scheduling, task parallelization

PDF Full Text Request

Related items

1	Design And Implementation Of A Data Analysis Task Scheduling System For IVCE Cloud Platform
2	Research On Key Technologies Of Resource Scheduling For Cloud Network Experimental Platform
3	Research On Efficient Cloud Task Scheduling Algorithm Based On Deep Reinforcement Learning
4	Research On Resource Scheduling Of Deep Learning Tasks In TensorFlow Platform
5	Research On Scheduling Technology To Minimize MXNet Resource Lease Over Public Cloud
6	Research On Cloud Task Scheduling Based On Deep Reinforcement Learning
7	Research On Resource Management Algorithm Based On Deep Q-learning In Cloud Computing Environment
8	Design And Implementation Of Virtual Resource Access Management And Task Scheduling In Private Cloud Platform
9	Design And Implementation Of A Hybrid Cloud Computing Platform With Open Scheduling Ability
10	Design And Implementation Of Cloud Task Resource Scheduling Platform Based On CloudSim