Data Center Network Resource Configuration And Transmission Optimization For Distributed Machine Learning

Posted on:2021-04-09

Degree:Master

Type:Thesis

Country:China

Candidate:Z K Jiang

Full Text:PDF

GTID:2428330611498840

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In order to solve the problem that machine learning training takes too long due to increasingly large data sets and parameter amounts,distributed machine learning(DML)has become one of the important means to accelerate machine learning model training.DML requires frequent network communication between multiple hosts when performing parameter synchronization.However,the Remote Direct Memory Access(RDMA)technology applied in DML network transmission cannot well support the network transmission characteristics during DML synchronization.This article will proceed from this point to design network transmission optimization for DML.First,in order to solve the problem of slow flow lagging DM L synchronization process due to multiple bottlenecks in the network,this paper proposes a Balanced Completion Time Protocol(BCTP).The BCTP protocol records and maintains the transmission status of data flows by network nodes,allocates rates based on the network status and the transmission status of the flows,uses Lyapurov's optimization to solve the allocation rate,and the server regulates the rate.On this basis,BCTP-NIC,BCTP-Switch and BCTP-Hybrid are also designed in this paper to meet the requi rements of different network equipments to deploy the BCTP protocol.The simulation results show that BCTP can reduce DML synchronous network communication overhead by up to 20%-45%.Furthermore,this paper also proposes a solution to accelerate DML synchronization by using Multicast,which solves the problem of redundant transmission of the network during DML synchronization.Multicast generates a multicast tree based on the principle of least link occupancy.The packet arrival rate and network status are used as the basis for Multicast flow rate allocation.The Lagrange multiplier method is used to solve the allocation rate.The experimental results show that Multicast can achieve 2-4 times the communication speedup under All-Reduce synchronization and up to the node multiple speedup under parameter server synchronization.Secondly,in order to solve the problem of low utilization of computing and network resources caused by the serial execution of computing and transmission tasks by working nodes,this paper proposes a Group Stale Synchronous Parallel model(GSSP).GSSP uses the calculation time and transmission time of DML as the time slot setting standard,and groups the working nodes according to the time slot size.GSSP adopts Bulk Synchronous Parallel in groups,Stale Synchronous Parallel between groups,and reduces network bandwidth competition through polling strategies between groups.By analyzing the regret bound of GSSP,it is found that GSSP has a faster convergence speed when the appropriate group size is selected.The experimental results show that GSSP has higher computing and network resource utilization.

Keywords/Search Tags:

Distributed Machine Learning, Parameter Synchro nization, BCTP, Multicast, GSSP

PDF Full Text Request

Related items

1	The Zero Correlation Zone Sequence Set Design And Apply In Wireless Communication System
2	Research On Network Traffic Scheduling Mechanism Of Distributed Machine Learning
3	Boundary Dynamics Analysis And Synchronization Control Of Memristive Neuron Systems
4	Research Of Parameter Optimizational Distributed SVM Based On Hadoop Platform
5	Research On Distributed Machine Learning Orientend Big Data Security Protection Technology
6	Stability Analysis And Synchronization Control Design Of Delayed Memristive Neural Networks
7	Towards Architecture Design And Performance Evaluation For Distributed Machine Learning Systems In Datacenter Networks
8	Load Balancing Strategy For Distributed Machine Learning System
9	Research On Parameter-exchanging Optimizing Mechanism In Distributed Deep Learning
10	Large-scale Distributed Machine Learning Platform