Font Size: a A A

Research And Implementation Of Performance Optimization Of Distributed Machine Learning System MXNet Under InfiniBand Network Architecture

Posted on:2019-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:B C LvFull Text:PDF
GTID:2428330611493141Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Big data brings opportunities and challenges to machine learning.On the one hand,big data allows machine learning to train more complex and accurate models to explore the deep value of data;on the other hand,powerful models will generate parameters of the order of 109 to 1012.The iterative nature of learning algorithms make it necessary to transmit these parameters frequently between distributed nodes,therefore,the network becomes the performance bottleneck of distributed machine learning systems.Compared to traditional Ethernet,InfiniBand,a high-speed Internet that is commonly used in high-performance computing,and its RDMA technology have the advantages of high bandwidth,low latency,and low CPU load.Today,advanced distributed machine learning systems like MXNet are not yet available to take advantage of this techonlogy.In response to the above problems,the main work of this paper is as follows:First,this paper tests and analyzes the InfiniBand communication mechanism to determine two optimized transfer strategies for MXNet.By analyzing the Infini Band architecture,the channel semantics and memory semantics provided by InfiniBand are experimentally tested for performance of SEND/RECEIVE operation,RDMA WRITE operation,and RDMA READ operation in different transfer modes and different transfer message sizes.Through test evaluation,it is found that SEND/RECEIVE in UC mode is more suitable for transferring small data,while RDMA READ and RDMA WRITE in UC mode is more suitable for transferring large data.Two optimized transfer strategies are chosen for MXNet:the SEND/RECEIVE+RDMA READ strategy and the SEND/RECEIVE+RDMA WRITE strategy.Second,two optimized transfer strategies are designed and implemented in detail to replace the MXNet system's own transfer module to support RDMA.Since the main source of communication overhead in the MXNet system is the interaction of model parameters between the worker nodes and the server nodes in the ps-lite architecture during the iterative calculation process,the RDMA transfer is supported by the specific analysis and improvement of the parameter passing process in the system.Two transfer optimization strategies previously proposed are designed and implemented in detail.The experimental results show that the improved ps-lite of the two optimization strategies is 2 to 3 times better than the unimproved ps-lite in the transfer operation push and pull.When running specific machine learning applications on MXNet,the improved MXNet is 1 to 3 times faster than the unmodified MXNet.By comparing the performance differences between the two optimization strategies and the reasons,it provides a reference for how to select two optimization strategies for applications with different parameter sizes.Third,by testing and analyzing the time overhead of RDMA management memory,this paper designs and implements an efficient MXNet message memory management mechanism.This paper tests the overhead of RDMA memory registration and deregistration operations in different memory sizes.The results show that the two operations can not be less expensive than the transfer operation.Therefore,a method for separately managing the memory of the small and large messages is proposed.The memory used by the small message is only registered and deregistered once,and the reuse of the small message memory is realized by memcpy;the large message memory is re-registered when the message is ready for transfering,and the memory threshold of the size-differentiated message is determined through experiments.Finally,we design and implement a small message memory pool that supports multithreading based on the fast?poll?allocator provided by the Boost library.It can realize memory reuse and greatly reduce the frequency of registration and deregistration operations,and slow down multithreading competition through multiple small message memory pools.
Keywords/Search Tags:InfiniBand, RDMA, MXNet, Distributed machine learning system
PDF Full Text Request
Related items