| With the rapid development of deep learning,the resource capacity and computing performance of a single node are inadequate for accommodating the training of large-scale deep learning models.Therefore,academia and industry have proposed a distributed deep learning(DDL)deployment method to utilize the GPU resources of the cluster.However,the distributed deployment method poses new challenges for cloud computing service providers,and there are still no reliable solutions for the two key technical points of workload balancing and interference-aware job scheduling strategies,which have become the bottleneck problem limiting further optimization of distributed deep learning.Based on the above research background,this paper designs a performance-aware distributed deep learning job scheduling model,which improves the cluster’s resource utilization and reduces the average job completion time through adaptive dynamic balancing strategies for jobs and execution time-aware scheduling policies.The main contributions of this paper are as follows:(1)In response to the performance skew phenomenon that occurs in different distributed deep learning jobs in a shared GPU cluster,this paper designs a dynamic load balancing strategy.For data parallel performance skew problems,a batch dynamic balance algorithm is proposed.Based on task performance monitoring,real-time adjustment of training batch load achieves dynamic load balancing between data parallel nodes;for model parallel performance skew problems,a sub-model balance algorithm is used to transfer network layers between working nodes to restore the model to a load-balanced state.Compared with no load balancing strategy,the training efficiency of data parallel jobs and model parallel jobs increased by 48.61%and 35.37%,respectively.(2)On the basis of the dynamic balance algorithm during the operation of distributed deep learning jobs,this paper designs a scheduling scheme for distributed deep learning jobs.In order to determine the resource occupancy of training loads,a random forest algorithm is used to predict the peak memory of GPUs for jobs,reducing OOM errors by 82.11% through experimental evaluation.The proposed execution time prediction model can accurately predict the execution time of specific models in nodes based on extracted resource time series features and job configuration features.Through experimental comparison,it reduces the prediction root mean square error by an average of 38.31%.Based on job GPU memory occupancy and execution time awareness,a preemptive high-response ratio queue is designed.The response ratio calculation optimizes the queue waiting time for jobs,while using preemption mechanisms can effectively avoid head-of-line blocking caused by large model scheduling and fully utilize cluster resources.By integrating the above work,this paper realizes the scheduling model of DDL jobs.According to the evaluation,compared with other comparative strategies,the proposed strategy can reduce the average job completion time by 61.46%,improve the average GPU utilization rate by 34.86%,and improve the GPU saturation index by 29.25%.The evaluation results above indicate that the proposed solution enhances the adaptive ability of distributed jobs in the cluster,effectively improves the resource utilization efficiency of GPU clusters,and has strong practical application value. |