Research And Implementation Of Pipeline-based Distributed Deep Learning Training Optimization Technology In GPU Cluster Environment

Posted on:2021-11-10

Degree:Master

Type:Thesis

Country:China

Candidate:J Zhan

Full Text:PDF

GTID:2518306476953119

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The training tasks of deep neural networks(DNN)are usually computing-and memoryintensive,often relying on a large number of GPU computing and memory resources in a GPU cluster,and improve training efficiency by distributed training.As for traditional distributed training,data parallelism has huge communication overhead due to parameter synchronization,while model parallelism has low GPU utilization due to computational dependencies,both of them affect the efficiency of distributed training.To this end,the latest pipelined distributed training method significantly increases GPU utilization by pipeline injecting training data in a time-sharing manner based model parallelism.However,when performing pipelined distributed training in existing GPU clusters,on the one hand,the heterogeneous network bandwidth between GPUs makes it difficult to perform model division and task placement.On the other hand,The training of larger deep neural networks in pipeline mode poses a huge limitation on insufficient GPU memory.In response to the above problems,this paper mainly focused on training optimization based on pipelined distributed deep learning in GPU cluster environment,and studied how to further improve training efficiency and increase the size of the supported DNN model,to achieve faster distributed training larger DNN model in a GPU cluster.The specific content includes:First of all,as for heterogeneous network among GPUs in GPU clusters,which is difficult to achieve load balancing model division and task placement in pipeline mode,this paper studied heterogeneous network-aware model division and task placement mechanisms.Based on the characteristics of network heterogeneous GPU cluster environment and pipeline parallel mode,this paper proposed a load balancing model division and task placement mechanism to reduce idle GPU states and improve the overall throughput of pipeline training,and then minimized the model training time.Secondly,regard as the insufficient GPU memory,due to the cache of multiple versions of intermediate results and model parameters in the pipeline mode,the overhead balance-driven memory recalculation optimization mechanism is studied.According to the analysis of the memory occupancy data in the pipeline mode,in order to support the larger neural network model pipeline training,we designed a GPU memory recomputing method with balanced cost in each stage of the pipeline,while maximizing pipeline throughput rate.Finally,this paper designed and implemented a pipeline-based distributed deep learning training optimization system in a GPU cluster environment.Based on the real GPU cluster environment in the cloud computing platform of Southeast University,this paper combined theoretical research results with practice,designed and implemented a prototype system,and carried out deployment and experiments.The experimental results show that the pipeline-based distributed deep learning training optimization mechanism proposed in this paper can not only improve the distributed training efficiency,but also ensure that the super-large neural network can perform pipelined distributed training,achieving distributed training with higher acceleration ratio and lower GPU memory.

Keywords/Search Tags:

Distributed deep learning, Pipeline-hybrid parallelism, Heterogeneous network environment, GPU memory optimization

PDF Full Text Request

Related items

1	Optimization Of Distributed Training Strategies For Deep Learning Networks
2	Research On Efficient Distributed Parallel Algorithm Of Deep Learning Framework Tensorflow
3	Runtime Optimization For Large-Scale Neural-Network Data-Parallelism Training
4	Research On Cache Optimization Mechanism In Heterogeneous Memory Environment
5	Research On Deep Learning Syntax Extension And Compilation Method Of COStream Language
6	Research On Key Technologies Of Memory Management And Communication Optimization For Deep Learning System
7	Research On Memory Reuse And Optimization Methods For Deep Learning System
8	Design And Implementation Of Distributed Cache For Heterogeneous Multilevel Storage
9	On The Depth And Big Model Of Deep Neural Networks: Theory And Algorithm
10	Parallel And Distributed Training Of Deep Learning