Research And Implementation Of Data Parallel Training Optimization Methods For Deep Learning Models

Posted on:2024-02-27

Degree:Master

Type:Thesis

Country:China

Candidate:Q F Cao

Full Text:PDF

GTID:2558307079972289

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

With the rapid development of artificial intelligence,deep learning models and training data are becoming larger and larger.The scale of model parameters gradually increases,while the magnitude of data used for training also increases significantly.Although this improves the capability of the model,the increase in training data,model parameters also leads to an increase in training time,so this thesis investigates the direction of data parallelism to reduce the distributed training time.First,based on the problem of gradient synchronization waiting during training caused by resource heterogeneity or resource sharing in cluster training,this thesis proposes the dynamic division of training data batch size algorithm(I-DAT)to ensure the load balance of the cluster throughout the training process,thus reducing the gradient synchronization waiting time among nodes.According to the experimental results,this algorithm effectively reduces the synchronization waiting time and achieves better smoothing results when the machine performance fluctuates.The experimental results show that the application of the I-DAT algorithm in the clustered environment of this thesis can obtain a speedup ratio of about 1.01-1.04,and the larger the difference in machine performance in the cluster,the better the speedup ratio improvement can be obtained by applying this algorithm.Secondly,this thesis analyzes the operation flow of Ring-Allreduce communication architecture in detail and proposes a fusion communication transmission algorithm(MRA)based on it to reduce the gradient synchronization and parameter update time.The experimental results show that the MRA algorithm can effectively reduce the time consuming of gradient synchronization and parameter update sessions.Meanwhile,the experimental results show that the algorithm of this thesis can get better results in the model with high communication time ratio,and the speedup ratio of about 1.37-1.66 can be obtained after combining I-DAT and MRA strategies.Finally,in order to facilitate users to easily create a unified training environment in the cluster to reduce a lot of repetitive work of experimenters,this thesis chooses to build a container-based distributed training prototype system based on Docker and Kubernetes,and conducts test deployments.The system provides functions such as container creation,resource monitoring and task resource estimation.In summary,this thesis reduces the total training time by optimizing the training process time consumption,reduces the gradient synchronization waiting time in the cluster using the I-DAT algorithm,and reduces the gradient synchronization and parameter update time using the MRA algorithm.At the same time,in order to facilitate users to use this thesis ’s algorithm for distributed training,this thesis builds a distributed training prototype system by container and other techniques to provide users with a training environment that can be used directly.

Keywords/Search Tags:

Deep learning, Data parallelism, Distributed training, Load balancing, Communication architecture

PDF Full Text Request

Related items

1	Optimization Of Distributed Training Strategies For Deep Learning Networks
2	Research On Optimizing The Training Efficiency Of Distributed Deep Learning For Heterogeneous GPUs
3	The Research On Key Technologies Of DNN Hybrid Parallel Training
4	Runtime Optimization For Large-Scale Neural-Network Data-Parallelism Training
5	Communication Optimization Technique For Distributed Synchronous Data Parallel Training
6	Research On Efficient Distributed Parallel Algorithm Of Deep Learning Framework Tensorflow
7	Optimization Of Memory And Communication For The Pipeline Parallelism
8	Research On Edge-oriented Hybrid Distributed Deep Learning Training Strategy
9	Research On Dynamic Load Balancing For HLA Simulation With Intra-federate Parallelism
10	System Support For Low-Rank Decomposition Gradient Compression Algorithms In Deep Learning Data Parallel Training