Font Size: a A A

Improvement Of Gradient Sparsification In Distributed Deep Learning

Posted on:2022-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:S Q LiFull Text:PDF
GTID:2518306332967919Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Deep learning neural networks have evolved into powerful tools that can be used for artificial intelligence tasks including image classification,video tracking and natural language processing.In order to reduce the training cost of large deep neural networks,distributed deep learning is a widely-used solution.Distributed deep learning can overcome the limitations of storage and computing power of a single machine,and makes it possible for large deep neural network training to obtain accelerated benefits linearly related to the number of machines.However,in distributed deep learning,the communication overhead of machines becomes the bottleneck of training acceleration.Gradient compression is needed to reduce the transmitted parameters overhead.Among all the gradient compression methods,gradient sparsification is popular.The traditional gradient sparsification method is to filter the transmitted gradients according to the selection strategy.Only the important gradients can be updated,and the unimportant gradients are accumulated locally waiting for the next iteration.However,when a high sparsity is adopted,a large porpotion of gradients are delayed.In the optimization process,the adaptive optimizers can not distinguish the historical gradients and the latest gradients in the current transmitted gradient,resulting in the deviation of model convergence and the decline of model accuracy.In this paper,a General Gradient Sparsification Framework(GGS)is proposed to overcome the drawback when the gradients are deep compressed.GGS consists of two important mechanisms:Gradient Correction and Batch Normalization Update with Local Gradients(BN-LG).The gradient correction method adjusts the order of the gradient update steps,cancels the global optimizer,and sets the optimizer locally,so that the adaptive optimizers can correctly distinguish the historical gradients from the latest gradients.The delayed update parameters can be correctly handled,and the problem of model convergence diversity can be solved.We provide a mathematical proof and demonstrate the generalization and convergence of the gradient correction method under various adaptive optimizers.In addition,the BN-LG method adopts the hybrid update mode,cancelling the synchronous update for the batch normalization layer.By using BN-LG,GGS can reduce the influence of delay gradient without increasing communication overhead.We have conducted experiments on LeNet-5,CifarNet,DenseNet-121,and AlexNet with adaptive optimizers.Results show that when 99.9%gradients are sparsified,validation datasets are maintained with top-1 accuracy.For convenience,we integrated GGS into PyTorch and packaged it in an open-source distributed training platform OpenPAI.
Keywords/Search Tags:Deep Learning, Distributed Learning, Gradient Sparsification, Adaptive Optimizer
PDF Full Text Request
Related items