Font Size: a A A

Efficient Resource Allocation Technology For Partially Predictable Deep Learning Training Jobs

Posted on:2022-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:F S LiFull Text:PDF
GTID:2558307154975009Subject:Engineering
Abstract/Summary:
As the scale and complexity of distributed deep learning models continue to increase,model training is becoming more and more expensive.Large-scale deep learning training even requires hundreds of servers to run for tens of days to complete,which makes the development of large-scale training jobs to be undertaken by a few large companies.However,even with such a huge expenditure,the company cannot obtain matching benefits.This is due to the low training efficiency and imbalanced usage of training resources in clusters.In order to solve the above problems,this article proposes two features that can further improve training efficiency and resource utilization: partial predictability training and unified CPU and GPU training.Based on the above two features,this article designs a new resource scheduler AITurbo,which divides distributed training jobs into predictable jobs and unpredictable jobs.The two jobs execute different scheduling strategies,and uniformly allocate heterogeneous CPU and GPU resources.For predictable jobs,AITurbo builds a predictive model to calculate the performance of the job under different heterogeneous resource configurations.For unpredictable jobs,AITurbo prioritizes the jobs according to the number of services obtained by the job.In order to dispatch predictable and unpredictable jobs in a unified manner,AITubro designed a Borda-count based multi-level feedback queue method to uniformly calculate the job priority.The scheduler has been implemented on the container management framework Kubernetes,supports multiple distributed deep learning models and multiple types of data sets,and realizes an integrated process from job deployment,operation,maintenance,and destruction.Experimental results show that compared with the latest schedulers Optimus and Tiresias,AITurbo has improved resource utilization by at least 15%,and the average job completion time has been reduced by more than 2×,while the overall cost of the scheduling process only accounts for 2.23% of the job training time.
Keywords/Search Tags:Distributed deep learning, Partial predictability training, Heterogeneous resources
Related items