Font Size: a A A

Research On Distributed Optimization Methods For Large-Scale Machine Learning

Posted on:2022-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:X F LiangFull Text:PDF
GTID:2518306323962479Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
For large-scale machine learning problems,stochastic gradient descent(SGD)is a fundamental tool.However,with the expansion of data and model scale,the training of machine learning models,especially deep learning models has become increasingly time-consuming.Therefore,it is very important to design distributed algorithms for accelerating training in machine learning research.Distributed stochastic gradient descent has been widely adopted in the training of machine learning models,which apply multiple workers in parallel.However,in prac-tice,the communication cost in model averaging prevents such distributed training from achieving a linear time speedup.Local-based algorithms,including Local SGD and Fe-dAvg,have gained much attention recently due to their superior properties,such as low communication cost and privacy-preserving.However,local-based algorithms would encounter a significant degradation in the convergence rate when the data distribution on workers is non-identical.In this paper,we improve its convergence rate under het-erogeneous data and further reduce the communication cost in distributed optimization.Previous studies prove that the communication complexity of Local SGD is in the order of(?)(T1/2)and(?)(T3/4)when the data distributions on workers are identical or non-identical.To prevent local-based algorithms from slow convergence rate in the heteroge-neous data,we propose Variance Reduced Local SGD(VRL-SGD),which is more robust under high data variance among workers.We theoretically proved that VRL-SGD can achieve a linear ite ration speedup with lower communication complexity(?)(T1/2)com-pared to(?)(T3/4)in Local SGD.For some practical scenarios with high data variance,we also present an effective warm-up mechanism,VRL-SGD-W,to eliminate the impact of the high data variance among workers.The experimental results demonstrate that VRL-SGD performs impressively better than Local SGD for the heterogeneous data and VRL-SGD-W is much robust under high data.The algorithms based on Local SGD mostly adopt the strategy of the fixed period and step size.However,a period or step size that changes adaptively with the itera-tion process often has better performance and achieves the optimal solution faster.In order to further reduce the communication complexity,based on VRL-SGD,this pa-per proposes STagewise Variance Reduced Local SGD(ST-VRL-SGD).ST-VRL-SGD increases the communication period and decreases the step size adaptively,and it is proved that the communication complexity is reduced to(?)(log(T))for strongly convex function whether the data distribution is identical or not.The experimental results show the superiority of ST-VRL-SGD algorithm for strongly convex function.
Keywords/Search Tags:Distributed Optimization, Machine Learning, Federated Learning, Vari-ance Reduction, Communication Complexity
PDF Full Text Request
Related items