In recent years,Deep Neural Networks(DNNs)have made remarkable breakthroughs in many application areas,but the training time required to train increasingly complex DNN models on increasingly large datasets is also growing rapidly.Distributed training has become the mainstream training method for DNN models because it reduces training time by integrating multiple nodes’ computational resources.Distributed deep learning is realized through the collaboration of a large-scale cluster of nodes.Consequently,improving distributed training efficiency faces two primary challenges.First,the straggler problem reduces the utilization of heterogeneous GPU clusters;second,the extensive communication overhead makes it difficult to improve the effective utilization of heterogeneous GPU clusters.To solve the above problems,this research analyzes the distributed training process and proposes a load balancing and communication optimization method to maximize the utilization of cluster computing resources while ensuring model training accuracy,thereby significantly reducing training time.The main contributions of this research are as follows:(1)A load balanced method LBB(Load Balanced Batching)is proposed.To address the straggler problem in distributed deep learning training on heterogeneous GPU clusters by the mainstream Synchronous Stochastic Gradient Descent(SSGD)algorithm.LBB achieve load balancing by assigning and adjusting the batch size of SGD to nodes,and reduce the synchronization waiting overhead caused by stragglers.This method first analyzes the distributed deep learning process,establishes the performance model of GPU,and builds a load balancing problem model.Then,by designing a batch size dynamic coordination algorithm,we solve the load balancing problem before training to achieve static load balancing on the one hand,and adjust the batch size of all nodes during training to achieve dynamic load balancing to alleviate or even eliminate the straggler problem.(2)A communication optimization method,LBCAL(Load Balanced and Communication-aware Local SGD),is proposed based on the LBB method.This method overlaps the communication overhead of distributed deep learning training task.LBCAL first analyzes the communication overhead of distributed deep learning,and then designs an adaptive communication interval adjustment strategy to balance computational efficiency and statistical efficiency based on real-time learning rate and training loss value.Finally,it mitigates the global model parameter staleness due to overlap by designing a model parameter staleness compensation mechanism.LBCAL improves cluster computational efficiency while minimizing statistical efficiency degradation due to computational and communication overlap.Based on extensive experimental results,LBB and LBCAL perform well on a heterogeneous GPUs based server.The standard datasets CIFAR10 and CIFAR100 are used to train four typical deep learning models for experiments.Experimental results demonstrate that,compared to the SSGD algorithm,the use of the LBB algorithm can reduce training time by 59.1% while maintaining nearly the same model accuracy.Furthermore,when communication is constrained,the LBCAL algorithm further reduces the training time by45.7% compared to the LBB algorithm,and the loss in model training accuracy is reduced from 3.98% to 1.28%. |