Font Size: a A A

Communication Optimization Technology For Distributed Machine Learning Framework

Posted on:2021-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:R YangFull Text:PDF
GTID:2428330605481161Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,the scale and complexity of data is increasing dramatically.Using a single node and traditional machine learning methods for machine learning model training is faced with severe challenges.In order to meet the needs of large-scale machine learning algorithms,the use of distributed machine learning systems for machine learning model training has become mainstream.At present,most distributed machine learning systems are designed based on the idea of a parameter server system,and the parameter communication consistency model is used to weigh the relationship between computing and communication.This article focuses on how to accelerate the training of distributed machine learning models.Traditional distributed machine learning model training usually follows the Bulk Synchronous Parallel(BSP)model.The model clearly divides the calculation phase and communication phase.Each time the global model is updated,it is necessary to wait for all computing nodes to enter the synchronization barrier,that is,the training time is completely limited by the slowest computing node.In the case of cluster performance differences,the Bulk Synchronous Parallel Model will have serious synchronization waiting problems.In response to this problem,the asynchronous parallel(ASP)model and the stale synchronous parallel(SSP)model have been improved by using the fault tolerance of the iterative-convergent algorithm:each computing node uploads the calculated gradient to the parameter server using asynchronous training.However,due to the delay of the local model copy,these two models raise a new problem:the gradient over-delay problem.With the advent of the era of big data,the scale and complexity of data is increasing dramatically.The use of single nodes and traditional machine learning methods for machine learning model training faces severe challenges.In order to meet the needs of large-scale machine learning algorithms,the use of distributed machine learning systems for machine learning model training has become mainstream.At present,most distributed machine learning systems are designed based on the idea of parameter server systems.The parameter server system is divided into two parts:a parameter server and a computing node.The parameter communication consistency model is used to weigh the relationship between computing and communication.This article focuses on how to accelerate the training of distributed machine learning models.In view of the above problems,this article conducts in-depth research on communication optimization technology for distributed machine learning framework,and innovatively proposes two parameter communication consistency models:limited synchronous parallel(LSP)model and adaptive limited synchronous parallel(ALSP)model.Based on these two consistency models,the distributed machine learning framework Kudus is implemented.The main research work of this article includes the following aspects:(1)This paper analyzes the advantages and disadvantages of communication consistency models with different parameters in depth,and utilizes the dynamic fault tolerance of iterative-convergent algorithms to implement a performance-driven communication model:limited synchronous parallel model.This model optimizes the synchronization waiting problem and the gradient over-delay problem from the perspective of relaxing the synchronization barrier.This model implements a limited synchronization barrier,allowing some computing nodes that first enter the limited synchronization barrier to synchronize first in each synchronization phase.The model also has the feature of limited asynchronous parallelism,which accelerates the training of distributed machine learning models.This paper designs and implements a limited synchronous parallel model,and analyzes the convergence of the model theoretically.(2)This paper optimizes on the basis of the limited synchronous parallel model and implements a performance-driven communication model:adaptive limited synchronous parallel model.Aiming at application scenarios such as dynamic performance changes and resource constraints on the system structure,the model is adapted using the dynamic fault tolerance of the iterative-convergent algorithm to solve the problem of gradient over-delay.The model further divides the synchronization method into group synchronization and global synchronization,and collects performance indicators of each computing node in real time through the performance monitoring system to provide a data basis for the model.This model effectively reduces the synchronization waiting time of the computing nodes in the group,and also reduces the delay of the model between the groups,thereby effectively balancing the load of the cluster nodes and more suitable for the cluster environment of real production work.(3)This paper adopts the design idea of parameter server system to realize a distributed machine learning framework Kudus.Finally,on Kudus,the above two consistency models were experimentally verified.The results show that the LSP model and ALSP model proposed in this paper effectively improve the training efficiency of distributed machine learning models.
Keywords/Search Tags:Distributed Machine Learning, Parameter Server, Communication Optimization, Limited Synchronous Parallel Model, Adaptive Limited Synchronous Parallel Model
PDF Full Text Request
Related items