Research And Implementation Of Distributed Machine Learning Acceleration Component Based On RDMA Batch Operation

Posted on:2021-02-15

Degree:Master

Type:Thesis

Country:China

Candidate:R H Zhang

Full Text:PDF

GTID:2428330611450314

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Today,machine learning has achieved remarkable success in various applications Considering the growing data sets and increasingly complex training models,it is often difficult for a single machine to complete model training within an acceptable time Distributed Machine Learning(DML)stands out as a common practice to reduce time costs.As the scale of DML clusters continues to grow,how to coordinate the computation and communication between nodes to maximize the potential of DML systems has become one of the hotspots in the field of system research.Based on this system optimization problem,combined with the current research status,this paper proposes a DML acceleration solution based on RDMA.The main work of this article is as follows:Aiming at the problem of low transmission efficiency of RDMA in large-scale DML systems.This paper presents a DML network acceleration scheme based on RDMA batch operations.The scheme uses carefully designed bitmaps to perceive messages and reclaim reusable memory.Compared with the RDMA communication based on one-by-one confirmation,it can further reduce the CPU participation in the process of using RDMA through batch operation,and the method can support higher concurrency.Considering the poor encapsulation of the bottom layer of RDMA,this paper encapsulates the solution as a middleware between the DML application and the RDMA communication library.Compared with the abstract RDMA underlying library,the application layer program can call the communication interface simply and conveniently to achieve efficient network communication.Experimental results on typical DML communication models show that this method can effectively reduce transmission delay and CPU utilization,while further improving the system throughput performance.Aiming at the Straggler problem in the BSP synchronization mode and the problem of low overlap in computation and communication during the iterative process.This paper proposes a scheme combining dynamic scheduling and network acceleration(DSANA)to improve the iterative efficiency of DML.In the dynamic scheduling design,the parameter server can accurately identify the node attributes and compress the iteration time by transferring part of the computation tasks from the Straggler node to the Faster node.In network acceleration design,the transmission accelerator improves the overlap of computation and communication by customizing the large transmission parameter blocks,thus further improving the overall communication efficiency of DML.This solution performs comparative experiments on four training sets of different sizes.The experimental results show that the combination of dynamic scheduling and network acceleration can effectively improve the iterative efficiency of DML,and it has better tolerance in typical public cloud heterogeneous hardware scenarios.

Keywords/Search Tags:

Distributed Machine Learning, RDMA, Batch Operations, Dynamic Scheduling, Network Acceleration

PDF Full Text Request

Related items

1	Research Of Flexible Job Shop Batch Rescheduling Methods
2	Research On Algorithms Of Single Machine Parallel Batch Scheduling Problems
3	The Design Of RDMA Based Distributed Virtual Machine Monitor
4	Dynamic Adaptive Weighted Polymorphic Ant Colony Algorithm For Scheduling Batch-processing Machine With Non-identical Job Sizes
5	Research On Scheduling Optimization Of Distributed Machine Learning System
6	Research And Implementation Of Performance Optimization Of Distributed Machine Learning System MXNet Under InfiniBand Network Architecture
7	Research On Network Acceleration Mechanisms For Distributed Machine Learning
8	Algorithm Designing Of Varible Batch Flexible Job-shop Scheduling With Multi-products
9	Use RDMA To Accelerate The Distributed Deep Learning
10	Design And Implementation Of High-Performance Paxos On RDMA