Font Size: a A A

Communication Efficient Distributed Parallel Stochastic Optimization Algorithms

Posted on:2021-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:S H ShenFull Text:PDF
GTID:2428330602999093Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the increase in the amount of data and the expansion of model scale,stochas-tic gradient descent algorithm(SGD)and its distributed parallel variants become workhorses for large scale machine learning tasks.Although distributed stochastic gra-dient descent algorithms can achieve a linear iteration speedup,they are limited signifi-cantly in practice by the communication cost,making it difficult to achieve a linear time speedup.Therefore,it is very important to design communication efficient distributed parallel algorithms in machine learning research.In this paper,we propose algorithms to reduce the communication cost from two different point of views.For the training of large scale deep learning tasks,we propose a computation and communication decoupled stochastic gradient descent(CoCoD-SGD)algorithm to run computation and communication in parallel to reduce the communication cost.We prove that CoCoD-SGD has a linear iteration speedup with respect to the total com-putation capability of the hardware resources in both homogeneous and heterogeneous environments.In addition,it has a lower communication complexity and better time speedup comparing with traditional distributed SGD algorithms.Specifically,when we use N devices to collaboratively conduct T iterations,the communication complexity of CoCoD-SGD is O(N3/4T3/4),which is consistent with the state-of-the-art algorithm Local SGD,while CoCoD-SGD achieves much better time speedup as it runs computation and communication in parallel.Experiments on deep neural network training demonstrate the significant improvements of CoCoD-SGD.From the perspective of communication complexity,previous studies prove that the communication complexity of Local SGD with a fixed or an adaptive communica-tion period is in the order of O(N3/2T1/2)and O(N3/4T3/4)when the data distributions on clients are identical(?D)or otherwise(Non-?D).In this paperwe propose STagewise Local SGD(STL-SGD),which increases the communication period gradually along with decreasing learning rate.We prove that STL-SGD can keep the same convergence rate and linear speedup as mini-batch SGD.In addition,as the benefit of increasing com-munication period,when the objective is convex or satisfies the Polyak-Lojasiewicz condition,the communication complexity of STL-SGD is O(N log T)and O(N1/2T1/2)for the ?D case and the Non-?D case respectively,achieving significant improvement over Local SGD.Experiments on both convex and non-convex problems demonstrate the superior performance of STL-SGD.
Keywords/Search Tags:Large Scale Machine Learning, Distributed Parallel Optimization, Stochastic Optimization, Computation and Communication Decouple, Communication Complexity
PDF Full Text Request
Related items