Font Size: a A A

Parallel And Distributed Training Of Deep Learning

Posted on:2021-07-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q J YaoFull Text:PDF
GTID:1488306518484154Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,in order to improve the performance,the scale of deep learning models and training data sets have been increasing.However,model training based on large-scale data has the problems such as slow convergence and time-consuming computation,which challenges the optimization algorithms and hardware computing capabilities.From the perspective of computation,parallel distributed training uses the computation power of multicore or many-core processors,clusters,or cloud resources to reduce the computation time of the optimization algorithm,and achieving the purpose of accelerating model training.For different scenarios,many deep learning models have been proposed.They have different network structures and computation feature.Efficient parallel training must follow the guidance of model features and hardware computing structure.For the parallel training of a single computing card,the mainstream parallel algorithms with matrix operations need to rewrite computation process of neurons in a matrix form,and its parallel performance depends on matrix library.For distributed training of a single graphics processing unit(GPU),although the synchronous data parallelism has better convergence,the speedup of distributed training is greatly affected by the computation characteristics of the model,and it needs to be optimized.About the above issues,the following three aspects are researched.From the perspective of the structure of hardware,the parallel training capability of many-core integrated core(MIC)coprocessors for deep learning models should be explored.Considering that the MIC card has parallel characteristics of thread level and single instruction multiple data flow,a two-level parallel algorithm based on mini-batch and neurons is proposed to train an unsupervised deep autoencoder model.In order to optimize the performance of parallel algorithms,the data transmission between the CPU and the MIC is hidden through the computation and transmission coverage strategy.The gradient recomputation strategy is used to reduce memory overhead,and the atomic operation is used to reduce thread synchronization overhead.The experimental results show that compared with the parallel method based on matrix computing,the speedup ratio of the proposed algorithm is stable,and the speedup ratio is 4.5 times compared with the CPU training.From the perspective of the network communication overhead of distributed training,by observing the distributed training speedup of deep learning models on a single-machine with multi-GPUs,the gradient exchange of fully connected neural networks will generate a lot of network overhead,which will extremely reduce the performance of distributed training.From this observation,a distributed training framework based on model averaging is proposed,in which the sub-models are synchronized at a lower frequency to reducing network overhead.The framework implements four algorithms of model averaging,optimization algorithm on multi-GPU distributed training.Multi-GPU parallelism is implemented through multiple streams.Data streams and computation streams are designed to the overlap of data transmission and calculation.Based on the topology structure of multiGPUs,a tree-reduce parameter exchange algorithm is designed.Experimental results show that compared with single GPU parallel training,the parallel framework based on model averaging achieves a 1.6 times speedup on two GPUs.Compared with the bulk synchronous parallel,it achieved a speedup of 17 times.From the perspective of distributed training algorithm,the parallel algorithm based on compound deep learning model is researched.The recommendation system is a typical application scenario of a compound deep learning model.Taking product recommendation in mobile games as an example,a compound recommendation framework based on long short-Term Memory,autoencoder,and attention network is constructed to meet the needs of the user interest and interpretability of recommendation results.Among them,long shortterm memory is used to model the user's immediate interests,autoencoder is used to compress sparse features,and attention network provides guidance for interpretability.In order to accelerate model training,considering the computing characteristics of each sub-model,a model parallel algorithm based on layer division is proposed.Compared with single GPU training,it achieves a speedup of 2.6 times on four GPUs.In summary,focusing on the parallel distributed training mechanism for deep learning,parallel training for deep autoencoders,distributed training for fully connected deep learning models,and distributed training for compound deep learning models are researched.Parallel algorithms and frameworks are designed to improve the computing power of the hardware,and accelerate the training of deep learning models.
Keywords/Search Tags:Deep learning, Parallel and distributed computation, Data parallel, Network overhead, Model averaging, Compound model, Model parallelism
PDF Full Text Request
Related items