Font Size: a A A

Research On Network Acceleration Mechanisms For Distributed Machine Learning

Posted on:2022-06-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:1488306728965339Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In order to support distributed machine learning(DML)tasks that are computationintensive and communication-intensive,many leading information technology(IT)companies,such as Microsoft,use graphics processing unit(GPU)and other accelerated hardware to set up clusters specifically for machine learning in data centers.The use of accelerated hardware greatly improves the sample processing speed of each Worker node,which helps to process more samples per unit of time,and also means that the network needs to transmit more data per unit of time.However,the growth rate of network bandwidth lags far behind that of hardware computing power.Therefore,the communication ability of network has become the performance bottleneck of DML training in data centers.In addition,many IT companies have set up multiple data centers around the world,and each data center stores the data provided for users and the related data generated by users.Some DML applications,such as speech recognition,video or image classification,need to analyze the massive data distributed in various data centers to obtain real-time and stable machine learning models.However,due to privacy,political,legal and other factors,it is not possible to gather all the data in one data center for training.Therefore,DML across data centers(Geo-distributed machine learning,Geo-DML)generally adopts a layered model synchronization architecture,i.e.decoupling model synchronization within data centers(local model synchronization)and between data centers(global model synchronization).However,compared with local area network(LAN)bandwidth,wide area network(WAN)bandwidth is more expensive,rare and heterogeneous,and has become a performance bottleneck for Geo-DML.This thesis focuses on two scenarios,intra-data center and inter-data center,to accelerate the training process of DML from the perspective of network communication.The research contents and major contributions are as follows:1.Online job scheduling problem for DML in optical circuit switch(OCS)networks is studied.In view of that the existing scheduling schemes of OCS is not suitable for DML which has the characteristics of iterative and interleaving communication and computation stages,this thesis designs scheduling algorithms for multiple DML training tasks.For intra-task scheduling,Heaviest Load First(HLF)algorithm is proposed,which prioritizes the flows on the most loaded port of OCS.For inter-task scheduling,Shortest Weighted Remaining Time First(SWRTF)is presented.When a training task is transferred from communication to computation stage,the available task with the minimum weighted completion time is selected to perform the communication stage,so as to improve the utilization of the circuit,speed up data transmission and reduce weighted job completion time.2.A new network topology suitable for parameter server(PS)architecture is studied.Since the existing physical topology design is independent of the communication mode of the upper application,the application-unaware topology will limit the performance improvement of the upper application.Therefore,based on the communication characteristics of the widely used PS architecture,PSNet,a reconfigurable and modular network topology,is proposed,which is suitable for the communication requirements of PS architecture.Besides,the batch completion time of DML training task in PSNet and Fat Tree topology is analyzed theoretically.The results of numerical calculation and local test platform experiments indicate that running DML training tasks on PSNet topology can significantly speed up the parameter synchronization process and reduce the single iteration synchronization time.3.An adaptive global parameter synchronization for Geo-DML is studied.The aggregation nodes in traditional global parameter synchronization represented by parameter server architecture for Geo-DML are fixed.However,the fixed global parameter synchronization scheme is not applicable to the heterogeneous and dynamically changing WAN bandwidth.Therefore,an adaptive global model synchronization optimization algorithm is proposed with considering the scarce,heterogeneous and dynamically changing WAN bandwidth,which can adaptively change the number,location and route of aggregation nodes.Besides,the performance bound of the algorithm is analyzed theoretically.The results of simulation and local test platform show that the adaptive algorithm significantly improves the bandwidth utilization and speeds up the global parameter synchronization process.4.DML scheduling problem in optical WANs is studied.Although many accelerated Geo-DML training schemes have been proposed for limited WAN bandwidth,most of them do not consider the reconfigurable characteristics of the underlying optical WAN.Hence,combing the network layer and the reconfigurable optical layer to optimize GeoDML training is proposed.For intra-task scheduling,it is proved to be a NP-hard problem,and then a new algorithm based on deterministic rounding is presented,which can dynamically change the topology structure by reconfiguring optical devices,and allocate paths and rates for each synchronous flow.Besides,the performance bound is analyzed theoretically.For inter-task scheduling,a multi-task scheduling algorithm based on task priority which is defined according to the task weight and remaining completion time is proposed.The simulation results show that the network layer scheduling combined with the reconfigurability of the WAN topology can significantly speed up the training process.
Keywords/Search Tags:distributed machine learning(DML), model synchronization, network topology, optical wide area network(Optical WAN), training time
PDF Full Text Request
Related items