With the rapid development of artificial intelligence,deep learning models and training data are becoming larger and larger.The scale of model parameters gradually increases,while the magnitude of data used for training also increases significantly.Although this improves the capability of the model,the increase in training data,model parameters also leads to an increase in training time,so this thesis investigates the direction of data parallelism to reduce the distributed training time.First,based on the problem of gradient synchronization waiting during training caused by resource heterogeneity or resource sharing in cluster training,this thesis proposes the dynamic division of training data batch size algorithm(I-DAT)to ensure the load balance of the cluster throughout the training process,thus reducing the gradient synchronization waiting time among nodes.According to the experimental results,this algorithm effectively reduces the synchronization waiting time and achieves better smoothing results when the machine performance fluctuates.The experimental results show that the application of the I-DAT algorithm in the clustered environment of this thesis can obtain a speedup ratio of about 1.01-1.04,and the larger the difference in machine performance in the cluster,the better the speedup ratio improvement can be obtained by applying this algorithm.Secondly,this thesis analyzes the operation flow of Ring-Allreduce communication architecture in detail and proposes a fusion communication transmission algorithm(MRA)based on it to reduce the gradient synchronization and parameter update time.The experimental results show that the MRA algorithm can effectively reduce the time consuming of gradient synchronization and parameter update sessions.Meanwhile,the experimental results show that the algorithm of this thesis can get better results in the model with high communication time ratio,and the speedup ratio of about 1.37-1.66 can be obtained after combining I-DAT and MRA strategies.Finally,in order to facilitate users to easily create a unified training environment in the cluster to reduce a lot of repetitive work of experimenters,this thesis chooses to build a container-based distributed training prototype system based on Docker and Kubernetes,and conducts test deployments.The system provides functions such as container creation,resource monitoring and task resource estimation.In summary,this thesis reduces the total training time by optimizing the training process time consumption,reduces the gradient synchronization waiting time in the cluster using the I-DAT algorithm,and reduces the gradient synchronization and parameter update time using the MRA algorithm.At the same time,in order to facilitate users to use this thesis ’s algorithm for distributed training,this thesis builds a distributed training prototype system by container and other techniques to provide users with a training environment that can be used directly. |