Font Size: a A A

Research And Application Of GPU Scheduling Strategy And Task Parallelization Method On Deep Learning Cloud Platform

Posted on:2021-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:X GengFull Text:PDF
GTID:2518306308478544Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the widespread use of GPUs in large-scale deep learning scenarios,the efficient execution of multiple deep learning jobs in GPU cluster has attracted great attention.The deep learning cloud platform integrates multiple GPU computing resources and can achieve efficient processing of large-scale deep learning tasks.However,the GPU scheduling strategy of the deep learning cloud platform based on Kubernetes treats GPU as the smallest resource allocation unit and allocates it to the container completely,which will cause GPU resources to be underutilized.At the same time,when multiple deep learning tasks are executed together,an unreasonable task parallelization strategy will cause resource competition between tasks.On the other hand,an unreasonable task parallelization strategy for the deep learning training tasks that require multiple GPUs will increase the communication cost between deep learning tasks,which leads to longer task execution time.First of all,this paper analyzes the resource scheduling strategy of Kubernetes container allocation management platform and proposes a fine-grained GPU scheduling optimization strategy,which comprehensively considers the GPU resource requirements of deep learning jobs and the GPU resource usage of each node in the cluster.This strategy avoids scheduling jobs with similar resource requirements to the same node,thereby enabling balanced use of multidimensional resources on the cloud platform nodes.Secondly,we propose an interference-aware performance prediction model,which can predict the impact of interference on the performance of multiple co-execution deep learning tasks on the GPU.And we propose an interference-aware and topology-aware deep learning task parallelization strategy based on the prediction model.According to the characteristics of deep learning tasks,our strategy can schedule deep learning tasks into appropriate GPUs taking into account the interference measured by co-executing tasks slowdown and the communication cost among tasks.Finally,we use Docker container technology to implement large-scale deep learning job mirrors and verify our GPU scheduling optimization strategy and deep learning task parallelization method proposed above in a deep learning cloud platform.Experimental results show that the GPU scheduling optimization strategy and the task parallelization method proposed in this paper can effectively improve the GPU utilization in the platform and the execution efficiency of deep learning jobs.
Keywords/Search Tags:deep learning, cloud platform, resource scheduling, task parallelization
PDF Full Text Request
Related items