| With the rapid development of artificial intelligence technology and the slowdown of Moore’s law,one single machine or accelerator can not meet the deep learning training tasks’ increasing requirements for memory resources and computing power.Distributed deep learning technology alleviates the problem of insufficient resources of one single machine by using multiple machines or accelerators to perform training tasks in parallel.However,with the increasing size of the deep learning model,the communication bottleneck caused by gradient synchronization is becoming more and more serious,which influences the distributed training performance.Both industrial and academical researchers focus on the communication bottleneck of distributed training and proposed many optimization strategies based on All Reduce or Parameter Server(PS)architecture.However,these works have failed to observe that,along with the fast-paced increase in model size,gradient sparsity follows a similar trend.Current collective communication libraries do not provide native support for sparse data.This leads to the transmission of a large number of zero data when aggregating the gradients,resulting in the waste of network bandwidth.Gradient compression algorithms can reduce the amount of communication without affecting the accuracy of deep learning models,which improves the training performance.However,the overhead of compression algorithm limits its application.Besides,gradient compression has a negative impact on the accuracy and convergence of model training when using large batch optimization for deep learning.The emergence of programmable network devices also provides new chances for improving performance of distributed machine learning.In-network aggregation can increases throughput,diminishes latency and speeds up distributed training time.However,the limited processing capabilities and on-chip memory of programmable network devices requires in-network aggregation algorithm to have low time and space complexity.How to design high performance in-network aggregation for sparse data with limited resources is a challenging and meaningful work.Focusing on the above problems and challenges,this paper investigate sparse collective communication,in-network aggregation and gradient compression for improving the distribtued performance.The contribution of this paper is as follows:(1).With the increase of model size,the gradients become more and more sparse.However,most of the existing collective communication libraries have no native support for sparse data.In this paper,we propose an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use by sending only non-zero data blocks.We demonstrate that the proposed algorithm can effectively improve the performance of distributed training.(2).The communication granularity not only affects the utilization of network bandwidth,but also affects the flexibility of flow control.This paper proposes a block fusion based sparse data aggregation algorithm for both dense and sparse format data.The proposed algorithm can not only achieve fine-grained transmission control,but also take full advantage of the network bandwidth via the fusion transmission of multiple data blocks and specific data mapping methods.The experimental results show that the block fusion based sparse data aggregation algorithm can effectively improve the stability of the aggregation performance.(3).The design of in-network aggregation algorithm is limited by the limited on-chip resources of programmable network devices.To solve these problems and challenges,this paper proposes an in-network aggregation algorithm for sparse data,which reduces the resource demand for the data plane of programmable network devices through multilevel co-design.We implement the system using Tofino programmable switch,and the experimental results demonstrate the effectiveness of in-network aggregation for sparse data.(4).The performance of gradient compression is always limited by its own overhead.Besides when using in large batch optimization algorithms,it will hurt the training accuracy and convergence.Focusing on these problems,this paper firstly propose a block-based gradient compression method,which can effectively reduce the compression overhead by using the sparse data aggregation algorithms proposed in this paper.At the same time,this paper proposes a gradient compression algorithm based on scaling function and mask operation,which solves the problem of poor accuracy and convergence of compression algorithms in large batch training optimization.The experimental results show that the proposed gradient compression strategy can effectively reduce the amount of communication without influencing the accuracy of the model. |