Font Size: a A A

Data Center Network Resource Configuration And Transmission Optimization For Distributed Machine Learning

Posted on:2021-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z K JiangFull Text:PDF
GTID:2428330611498840Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In order to solve the problem that machine learning training takes too long due to increasingly large data sets and parameter amounts,distributed machine learning(DML)has become one of the important means to accelerate machine learning model training.DML requires frequent network communication between multiple hosts when performing parameter synchronization.However,the Remote Direct Memory Access(RDMA)technology applied in DML network transmission cannot well support the network transmission characteristics during DML synchronization.This article will proceed from this point to design network transmission optimization for DML.First,in order to solve the problem of slow flow lagging DM L synchronization process due to multiple bottlenecks in the network,this paper proposes a Balanced Completion Time Protocol(BCTP).The BCTP protocol records and maintains the transmission status of data flows by network nodes,allocates rates based on the network status and the transmission status of the flows,uses Lyapurov's optimization to solve the allocation rate,and the server regulates the rate.On this basis,BCTP-NIC,BCTP-Switch and BCTP-Hybrid are also designed in this paper to meet the requi rements of different network equipments to deploy the BCTP protocol.The simulation results show that BCTP can reduce DML synchronous network communication overhead by up to 20%-45%.Furthermore,this paper also proposes a solution to accelerate DML synchronization by using Multicast,which solves the problem of redundant transmission of the network during DML synchronization.Multicast generates a multicast tree based on the principle of least link occupancy.The packet arrival rate and network status are used as the basis for Multicast flow rate allocation.The Lagrange multiplier method is used to solve the allocation rate.The experimental results show that Multicast can achieve 2-4 times the communication speedup under All-Reduce synchronization and up to the node multiple speedup under parameter server synchronization.Secondly,in order to solve the problem of low utilization of computing and network resources caused by the serial execution of computing and transmission tasks by working nodes,this paper proposes a Group Stale Synchronous Parallel model(GSSP).GSSP uses the calculation time and transmission time of DML as the time slot setting standard,and groups the working nodes according to the time slot size.GSSP adopts Bulk Synchronous Parallel in groups,Stale Synchronous Parallel between groups,and reduces network bandwidth competition through polling strategies between groups.By analyzing the regret bound of GSSP,it is found that GSSP has a faster convergence speed when the appropriate group size is selected.The experimental results show that GSSP has higher computing and network resource utilization.
Keywords/Search Tags:Distributed Machine Learning, Parameter Synchro nization, BCTP, Multicast, GSSP
PDF Full Text Request
Related items