Improvement Of Gradient Sparsification In Distributed Deep Learning

Posted on:2022-03-18

Degree:Master

Type:Thesis

Country:China

Candidate:S Q Li

Full Text:PDF

GTID:2518306332967919

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Deep learning neural networks have evolved into powerful tools that can be used for artificial intelligence tasks including image classification,video tracking and natural language processing.In order to reduce the training cost of large deep neural networks,distributed deep learning is a widely-used solution.Distributed deep learning can overcome the limitations of storage and computing power of a single machine,and makes it possible for large deep neural network training to obtain accelerated benefits linearly related to the number of machines.However,in distributed deep learning,the communication overhead of machines becomes the bottleneck of training acceleration.Gradient compression is needed to reduce the transmitted parameters overhead.Among all the gradient compression methods,gradient sparsification is popular.The traditional gradient sparsification method is to filter the transmitted gradients according to the selection strategy.Only the important gradients can be updated,and the unimportant gradients are accumulated locally waiting for the next iteration.However,when a high sparsity is adopted,a large porpotion of gradients are delayed.In the optimization process,the adaptive optimizers can not distinguish the historical gradients and the latest gradients in the current transmitted gradient,resulting in the deviation of model convergence and the decline of model accuracy.In this paper,a General Gradient Sparsification Framework(GGS)is proposed to overcome the drawback when the gradients are deep compressed.GGS consists of two important mechanisms:Gradient Correction and Batch Normalization Update with Local Gradients(BN-LG).The gradient correction method adjusts the order of the gradient update steps,cancels the global optimizer,and sets the optimizer locally,so that the adaptive optimizers can correctly distinguish the historical gradients from the latest gradients.The delayed update parameters can be correctly handled,and the problem of model convergence diversity can be solved.We provide a mathematical proof and demonstrate the generalization and convergence of the gradient correction method under various adaptive optimizers.In addition,the BN-LG method adopts the hybrid update mode,cancelling the synchronous update for the batch normalization layer.By using BN-LG,GGS can reduce the influence of delay gradient without increasing communication overhead.We have conducted experiments on LeNet-5,CifarNet,DenseNet-121,and AlexNet with adaptive optimizers.Results show that when 99.9%gradients are sparsified,validation datasets are maintained with top-1 accuracy.For convenience,we integrated GGS into PyTorch and packaged it in an open-source distributed training platform OpenPAI.

Keywords/Search Tags:

Deep Learning, Distributed Learning, Gradient Sparsification, Adaptive Optimizer

PDF Full Text Request

Related items

1	On The Robustness And Sparsification Of Adaptive Filtering Algorithms
2	Robust Adaptive Machine Learning Methods And Their Application
3	Application And Research Of Adaptive Optimization Algorithm In Deep Learning
4	Enhanced Gradient-based Optimizer And Application Research
5	Communication Optimization Of Distributed Deep Learning Based On Gradient Priority
6	Optimal Design And Implementation Of Distributed Deep Learning Training
7	The Research And Implementation Of Distributed Deep Learning Optimization Methods For Edge Computing
8	Fast Distributed Online Learning Algorithms In Networks
9	Optimization Of Distributed Training Strategies For Deep Learning Networks
10	Research On Parameter-exchanging Optimizing Mechanism In Distributed Deep Learning