Font Size: a A A

Research And Application Of Communication Efficient Asynchronous Distributed ADMM Algorithm

Posted on:2021-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:J Y XieFull Text:PDF
GTID:2428330614456833Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Today machine learning has become a basic means of extracting structured information and knowledge from raw data.There are two trends in the research and application of machine learning: larger-scale training data and more complex models,and the computing power and storage capacity of a single commercial computer cannot afford the computing tasks brought by training such a large model.This challenge has spurred research on distributed machine learning.Training machine learning models requires solving an optimization problem.The ADMM(Alternating Direction Method of Multipilers)algorithm is an effective algorithm for solving optimization problems.A large global problem is decomposed into multiple smaller and easier to solve local subproblems.The solution of the large global problem is obtained by coordinating the solutions of the sub-problems.With the help of the ADMM algorithm,the computing load of model training can be easily distributed to multiple computing nodes.The ADMM algorithm provides a theoretical basis for distributed machine learning.Scaling ADMM to large-scale clusters requires studying distributed computing models and communication modes between nodes.An important factor limiting the scalability of the ADMM is the large communication overhead caused by model synchronization.When using the master-slave communication mode,the communication overhead increases linearly with the cluster scale.This paper focuses on reducing communication overhead.The specific research contents include:1.According to the characteristics of the ADMM algorithm,the calculation and communication are separated,and an asynchronous triggered Allreduce operation is implemented.The asynchronous distributed ADMM algorithm is combined with the Allreduce communication mode.Experiments verify the efficiency and scalability of the asynchronous distributed ADMM algorithm in the Allreduce communication mode.2.Designed and implemented a communication interface Sparse Allreduce for sparse data communication,which aims to provide efficient communication support for more distributed optimization algorithms.Sparse Allreduce provides asynchronous Allreduce operation.This paper redesigns the implementation algorithm of Allreduce operation for sparse data communication.3.Based on Sparse Allreduce and asynchronous distributed ADMM algorithm,a model training framework ADMMDML is designed and implemented.Through incremental communication and message filtering strategies,ADMMDML can take full advantage of the characteristics of Sparse Allreduce,and with mixed precision training,ADMMDML can further reduce communication overhead.Experiments verify that ADMMDML can further optimize communication.In this paper,experiments were performed on the Shanghai University Ziqiang 4000 cluster system.The experiment results show that the ADMM algorithm in the Allreduce communication mode can reduce network latency by 80% compared to the ADMM algorithm in the master-slave model.Mixed precision training and message filtering can further reduce the network latency by 36% based on the Allreduce communication mode.
Keywords/Search Tags:Distributed Computing, Asynchronous ADMM, Communication Optimization, Programming Interface
PDF Full Text Request
Related items