Communication Efficient Distributed Parallel Stochastic Optimization Algorithms

Posted on:2021-03-15

Degree:Master

Type:Thesis

Country:China

Candidate:S H Shen

Full Text:PDF

GTID:2428330602999093

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the increase in the amount of data and the expansion of model scale,stochas-tic gradient descent algorithm(SGD)and its distributed parallel variants become workhorses for large scale machine learning tasks.Although distributed stochastic gra-dient descent algorithms can achieve a linear iteration speedup,they are limited signifi-cantly in practice by the communication cost,making it difficult to achieve a linear time speedup.Therefore,it is very important to design communication efficient distributed parallel algorithms in machine learning research.In this paper,we propose algorithms to reduce the communication cost from two different point of views.For the training of large scale deep learning tasks,we propose a computation and communication decoupled stochastic gradient descent(CoCoD-SGD)algorithm to run computation and communication in parallel to reduce the communication cost.We prove that CoCoD-SGD has a linear iteration speedup with respect to the total com-putation capability of the hardware resources in both homogeneous and heterogeneous environments.In addition,it has a lower communication complexity and better time speedup comparing with traditional distributed SGD algorithms.Specifically,when we use N devices to collaboratively conduct T iterations,the communication complexity of CoCoD-SGD is O(N3/4T3/4),which is consistent with the state-of-the-art algorithm Local SGD,while CoCoD-SGD achieves much better time speedup as it runs computation and communication in parallel.Experiments on deep neural network training demonstrate the significant improvements of CoCoD-SGD.From the perspective of communication complexity,previous studies prove that the communication complexity of Local SGD with a fixed or an adaptive communica-tion period is in the order of O(N3/2T1/2)and O(N3/4T3/4)when the data distributions on clients are identical(?D)or otherwise(Non-?D).In this paperwe propose STagewise Local SGD(STL-SGD),which increases the communication period gradually along with decreasing learning rate.We prove that STL-SGD can keep the same convergence rate and linear speedup as mini-batch SGD.In addition,as the benefit of increasing com-munication period,when the objective is convex or satisfies the Polyak-Lojasiewicz condition,the communication complexity of STL-SGD is O(N log T)and O(N1/2T1/2)for the ?D case and the Non-?D case respectively,achieving significant improvement over Local SGD.Experiments on both convex and non-convex problems demonstrate the superior performance of STL-SGD.

Keywords/Search Tags:

Large Scale Machine Learning, Distributed Parallel Optimization, Stochastic Optimization, Computation and Communication Decouple, Communication Complexity

PDF Full Text Request

Related items

1	Research On Distributed Optimization Methods For Large-Scale Machine Learning
2	Research On Large-scale Regularized Machine Learning Algorithms
3	On Network Optimization Technology For MXNet-based Large-scale Distributed Machine Learning
4	Research On Parallel And Distributed Stochastic Learning Algorithms
5	Communication Optimization Technology For Distributed Machine Learning Framework
6	Communication Dynamic Optimizing Technology For Distributed Machine Learning
7	Study And Implementation On Distributed Large Scale Matrix Computation Algorithms With Spark
8	Efficient Detection And Decoding Algorithms And Implementations For 5G Wireless Communication
9	The Study Of Large Scale Sparse Learning:Optimization And Its Applications
10	Design And Implementation Of High Performance Wireless Communication DSP With VLSI Based On Stochastic Computation