Font Size: a A A

Communication Dynamic Optimizing Technology For Distributed Machine Learning

Posted on:2019-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:H D TuFull Text:PDF
GTID:2428330548476381Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the era of big data,distributed machine learning has become a hot topic in the field of artificial intelligence because of its ability to adapt to the complexity of big data,achieve higher prediction accuracy and support more intelligent tasks.At present,most of the distributed machine learning is implemented by the parameter server system.In the parameter server system,the parameter server is used to update the global model parameters and communicate with each compute node.Compute nodes are used to train the machine learning model,and no communication is performed among the compute nodes.Therefore,the parameters communication strategy between parameter server and compute nodes becomes the key factor that affects the model accuracy and training time of distributed machine learning.The most primitive and most common parameters communication strategy is bulk synchronous parallel strategy,which makes each compute node need to enter the synchronization barrier at each iteration to wait for other compute nodes to complete their iterations.Bulk synchronous parallel strategy cannot guarantee cluster load balancing,greatly wastes the computing performance of the cluster,and makes distributed machine learning training too slow.In order to solve the above problems,some parameters communication optimization strategies are put forward by industry and academia: asynchronous parallel strategy no longer sets synchronization barrier.After computing local gradients,each compute node obtains the global model parameters from the parameter server by itself and enters the next iteration.Asynchronous parallel strategy maximizes the computing performance of the cluster but causes the global model parameters to become delayed and inconsistent,so the accuracy of the final model cannot be guaranteed.Stale synchronous parallel strategy expands the number of iterations into the synchronization barrier,which can effectively use the computing performance of the cluster.But stale synchronous parallel strategy does not fully consider the cluster environment,cannot adapt to the real working cluster environment,which also leads to model accuracy problems.To solve problems mentioned above,this thesis studies the communication dynamic optimizing technology for distributed machine learning and proposes two kinds of parameters communication optimization strategies: dynamic synchronous parallel strategy and adaptive parallel strategy.Based on these two parameters communication optimizing strategy,this thesis implements a distributed machine learning framework called Paracas.The main contents of this thesis are as follows:(1)This thesis analyzes the existing problems of stale synchronous parallel strategy and proposes a new dynamic strategy of parameters communication optimization,which is called dynamic synchronous parallel strategy.Dynamic synchronous parallel strategy is based on the dynamic finite fault tolerance of machine learning iterative-convergence training,which can effectively solve the problems of stale synchronous parallel strategy,makes full use of the cluster's computing performance,improves the accuracy of distributed machine learning training model,and speeds up training speed.After designing and implementing the dynamic synchronous parallel strategy,this thesis proves it in theory,which shows that the dynamic synchronous parallel strategy has the correct convergence and can ensure the correctness of the trained model.(2)This thesis carries on the inheritance and optimization of the dynamic synchronous parallel strategy,and proposes an adaptive strategy of parameters communication optimization,which is called adaptive parallel strategy.This thesis implements a performance monitoring model to more accurately adjust the number of iterations of compute nodes.Based on this performance monitoring model,adaptive parallel strategy can maximize the performance of the cluster,accelerate the speed of distributed machine learning training and ensure the accuracy of the trained model.(3)Due to the fact that the current open source machine learning framework Caffe only supports single node,this thesis implements a distributed machine learning framework by using the parameter server system that supports both dynamic synchronous parallel strategy and adaptive parallel strategy,which is called Paracas.The above two strategies are tested on Paracas,and experiments prove that parameters communication optimization strategies proposed in this thesis have the good performance and scalability.
Keywords/Search Tags:Distributed Machine Learning, Parameter Server, Communication Optimization, Dynamic Synchronous Parallel Strategy, Adaptive Parallel Strategy
PDF Full Text Request
Related items