Research And Implementation Of Performance Optimization Of Distributed Machine Learning System MXNet Under InfiniBand Network Architecture

Posted on:2019-08-06

Degree:Master

Type:Thesis

Country:China

Candidate:B C Lv

Full Text:PDF

GTID:2428330611493141

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Big data brings opportunities and challenges to machine learning.On the one hand,big data allows machine learning to train more complex and accurate models to explore the deep value of data;on the other hand,powerful models will generate parameters of the order of 10⁹ to 10¹².The iterative nature of learning algorithms make it necessary to transmit these parameters frequently between distributed nodes,therefore,the network becomes the performance bottleneck of distributed machine learning systems.Compared to traditional Ethernet,InfiniBand,a high-speed Internet that is commonly used in high-performance computing,and its RDMA technology have the advantages of high bandwidth,low latency,and low CPU load.Today,advanced distributed machine learning systems like MXNet are not yet available to take advantage of this techonlogy.In response to the above problems,the main work of this paper is as follows:First,this paper tests and analyzes the InfiniBand communication mechanism to determine two optimized transfer strategies for MXNet.By analyzing the Infini Band architecture,the channel semantics and memory semantics provided by InfiniBand are experimentally tested for performance of SEND/RECEIVE operation,RDMA WRITE operation,and RDMA READ operation in different transfer modes and different transfer message sizes.Through test evaluation,it is found that SEND/RECEIVE in UC mode is more suitable for transferring small data,while RDMA READ and RDMA WRITE in UC mode is more suitable for transferring large data.Two optimized transfer strategies are chosen for MXNet:the SEND/RECEIVE+RDMA READ strategy and the SEND/RECEIVE+RDMA WRITE strategy.Second,two optimized transfer strategies are designed and implemented in detail to replace the MXNet system's own transfer module to support RDMA.Since the main source of communication overhead in the MXNet system is the interaction of model parameters between the worker nodes and the server nodes in the ps-lite architecture during the iterative calculation process,the RDMA transfer is supported by the specific analysis and improvement of the parameter passing process in the system.Two transfer optimization strategies previously proposed are designed and implemented in detail.The experimental results show that the improved ps-lite of the two optimization strategies is 2 to 3 times better than the unimproved ps-lite in the transfer operation push and pull.When running specific machine learning applications on MXNet,the improved MXNet is 1 to 3 times faster than the unmodified MXNet.By comparing the performance differences between the two optimization strategies and the reasons,it provides a reference for how to select two optimization strategies for applications with different parameter sizes.Third,by testing and analyzing the time overhead of RDMA management memory,this paper designs and implements an efficient MXNet message memory management mechanism.This paper tests the overhead of RDMA memory registration and deregistration operations in different memory sizes.The results show that the two operations can not be less expensive than the transfer operation.Therefore,a method for separately managing the memory of the small and large messages is proposed.The memory used by the small message is only registered and deregistered once,and the reuse of the small message memory is realized by memcpy;the large message memory is re-registered when the message is ready for transfering,and the memory threshold of the size-differentiated message is determined through experiments.Finally,we design and implement a small message memory pool that supports multithreading based on the fast?poll?allocator provided by the Boost library.It can realize memory reuse and greatly reduce the frequency of registration and deregistration operations,and slow down multithreading competition through multiple small message memory pools.

Keywords/Search Tags:

InfiniBand, RDMA, MXNet, Distributed machine learning system

PDF Full Text Request

Related items

1	Design And Implementation Of Distributed Key-Value Storage System Based On RDMA
2	Performance Optimization Of Distributed Machine Learning Cluster System
3	Research And Implementation Of Distributed Machine Learning Acceleration Component Based On RDMA Batch Operation
4	Optimal Design And Implementation Of RDMA-based On Big Data System
5	On Network Optimization Technology For MXNet-based Large-scale Distributed Machine Learning
6	Optimization And Implementation Of Data Transmission Mechanism Based On RDMA
7	Optimization And Implementation Of Data Transmission Strategy Based On RDMA
8	The Design Of RDMA Based Distributed Virtual Machine Monitor
9	The Design And Implementation Of HDFS Based On Infiniband
10	Design And Implementation Of High-Performance Paxos On RDMA