In recent years,machine learning has been widely used in various fields,such as face recognition,speech recognition,autonomous vehicles and so on.The above appli-cations can not be realized without the support of large-scale data sets and large-scale machine learning models.However,the computing and storage resources of a single machine are limited.Training on a single machine may take a long time,and some-times it can not even store the whole training data set at all.So distributed machine learning,which uses multiple machines to complete the training task,attracts a lot of attention.Stochastic gradient descent is a commonly used optimization algorithm in machine learning,which is also widely used in distributed machine learning.However,there will be some challenges when applying stochastic gradient descent to multiple machines:(1)In heterogeneous networks,the performance of each machine is differ-ent.If each machine cooperates in a synchronous way,the overall performance will be limited by the slowest worker,which incurs low efficiency.If each machine cooper-ates in an asynchronous way,the staleness gradients will make the convergence process unstable,which may lead to the low accuracy of the final trained model.(2)Frequent communication between machines will bring a lot of overhead,and the model averag-ing method can reduce the communication overhead.Allowing each machine to iterate locally for different times further eliminates the need for them to wait for each other,but using the same weight for all workers makes the poor local models at the slow workers slow down the convergence speed of the global model.In view of the above problems,this thesis mainly studies how to improve the convergence performance of distributed stochastic gradient descent method in distributed machine learning.While ensuring the final convergence accuracy of the model,it also improves the convergence speed of the algorithm w.r.t.time.The main work of this thesis is as follows:1.Group stochastic gradient descent is proposed.In this algorithm,the work-ers with the same or similar performance are put into the same group,and the workers in the same group work synchronously,while different groups update the model at the parameter server asynchronously.The proposed method can migrate the straggler prob-lem since the workers in the same group spend little time waiting for each other.The staleness of the method is small since the number of groups is much smaller than the number of workers.The convergence of the method is proved through theoretical anal-ysis.Simulation results show that the method converges faster than SSGD and ASGD in the heterogeneous cluster,and the accuracy of the final trained model is higher.2.Weighted parallel restarted stochastic gradient descent in heterogeneous net-work is proposed.In this algorithm,the parameter server performs a weighted average based on the number of local iterations of each worker,and the worker with more lo-cal iterations are given more weight in the model averaging.Since the slower worker have less weight when the model is averaged,they do not seriously slow down the con-vergence speed of the global model.Theoretical analysis shows that this method is convergent,and give the optimal local iterate time.Simulation results show that the convergence speed of the proposed method is faster than that of traditional methods,and the accuracy of the final trained model is higher. |