Research On Deep Neural Network Training Acceleration Strategies With Data Parallelization

Posted on:2021-05-19

Degree:Master

Type:Thesis

Country:China

Candidate:J X Ye

Full Text:PDF

GTID:2428330623467795

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years,deep neural networks have shown amazing modeling capabilities in research fields such as images and speech,and are therefore becoming very popular in academia and industry.For the academic community and industry,it is necessary to have the ability to quickly train algorithm models in order to be able to quickly analyze experimental results and make algorithm adjustments.Therefore,this article first defines the parallel training strategy and process of deep neural network data in a multi-server multi-GPU scenario.Then based on the hook mechanism,this work implements a simple and easy-to-use data parallel distributed training extension interface under the PyTorch framework,and through analysis,it is found that in the data parallel training,the problem of fragmentation of gradient data is not conducive to the redundancy of full reduction communication,so An asynchronous communication strategy is proposed,which aggregates the fragmented gradient data and performs full reduction synchronization,which effectively improves the higher volume in communication.Shrinking,the PyTorch extension of this work also supports mixed precision training,and can achieve up to 1.71 times the training acceleration effect on GPUs with Tensor Core.At the same time,in order to solve the problem of gradient numerical overflow in mixed precision training,here is proposed An adaptive overlap-perceived loss amplification strategy effectively alleviates the problem of gradient non-convergence caused by gradient overlap in mixed-precision training.Therefore,this paper also proves that the data parallel training under multi-machine and multi-card should use local batch normalization,which is especially obvious for the acceleration effect of neural networks with more batch normalization layers.In the end,on 32 GPUs,a maximum efficiency of 99.4%has been achieved,and MobileNet-vl trained on can be completed in 1 hour and 37 minutes.In addition,through the distributed training expansion strategy proposed in this thesis,in the scenario where a large batch of training data is used,not only does the trained model have no decrease in accuracy,it is even higher than the official baseline.For example,ResNet-50 trained in this article reached 78.06%higher than the official 76.86%,while MobileNet-vl reached 73.48%,higher than the official 70.9%.

Keywords/Search Tags:

Data Parallel, High Performance Computing, Deep Neural Network, Distributed Computing

PDF Full Text Request

Related items

1	Simulation Runner: A Lightweight Cloud-based HPC Platform
2	Research Of High Performance Evolutionary Algorithm Based On Distributed Parallel Computing
3	Target Recognition Of High Resolution Remote Sensing Image Based On Multi-Media Neural Cognitive Computing Model
4	GPU Computing In Massive Data Processing
5	Design And Application Of High Performance Computing Platform
6	Parallel Computing Scalability Studies And Applications On The Distributed Memory Environments
7	Research On Parallel Computing Architecture Of Multiple CNN Models On FPGA
8	Modeling Of High Performance Computing On Many-core Processors
9	Preliminary Research On Distributed Collaborative High-performance Computing Framework For Spatial Information
10	Research Of High Performance GC-MS Data Analyzing Algorithm