Font Size: a A A

Use RDMA To Accelerate The Distributed Deep Learning

Posted on:2020-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:C LiuFull Text:PDF
GTID:2428330623463629Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Deeper models and larger datasets are two major ingredients for applying deep learning(DL)on real-world problems,which inevitably shifts model training from on a single GPU card to on a GPU clusters due to limited GPU memory and time-tosolution requirements.High-speed low-latency RDMA-capable network fabrics like InfiniBand and RoCE play an important role on coping with enoumous data exchanged during training.DL frameworks are built upon these fabrics with various APIs including IPoIB,MPI and RDMA Verbs.Tradeoffs are made between performance and usability when adapting DL frameworks onto RDMA-capable networks,which may result in highperformance yet hard-to-maintain and hard-to-merge code if improper design choices are made.This paper presents our approach to adapt MXNet,a modular versatile DL framework onto RDMA-capable networks.Dividing the training process on MXNet into P2 P communication and AllReduce commnunication,we add incremental optimizations on its message passing code.Experiments show that our approach exhibits near-linear speedups,whose parallel efficiency reaches 96% compared to 53% of the original IPoIB version when scaling to 100 GPU cards.In contrast to other MPI-based porting approach,our modifications are limited within MXNet's Parameter Server module,which is transparent for upper-layer operations,thus making no sacrifice on features like auto recovery and flexible consistency.
Keywords/Search Tags:RDMA, Deep Learning, Network
PDF Full Text Request
Related items