Font Size: a A A

The Research Of Accleration Of Distributed Machine Learning

Posted on:2022-05-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:P ZhouFull Text:PDF
GTID:1488306524470834Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Machine learning is a powerful tool for extracting information from big data.How-ever,training a model to a desired accuracy on a large scale dataset by using one worker nodes is rather time-consuming.It is common to use multiple worker nodes to realize data parallel training,where each node iteratively processes part of the dataset,and synchro-nizes the model information with other nodes through the network each iteration.As the model size ranges from 100 s MB to 1000 s MB,each node has to send and receive large amount of data equal to the model size every iteration,which makes distributed machine learning a communication-intensive and computation-intensive application.To support the AI-driven services,many large companies,such as Google,Microsoft and Amazon,have built GPU clusters that are specially used for model training.A cluster is typically shared by multiple users to maximize the utilization of the expensive infrastructures.Many distributed training jobs submitted by various users are running in the cluster.Due to the fragmentation of available resources in the cluster,the worker nodes of a job are usually distributed on multiple GPU servers,and a GPU server usually supports multiple worker nodes belonging to different jobs at the same time.Therefore,it is inevitable for jobs to compete for the limited network bandwidth.According to Microsoft's data statistics,the utilization of GPUs in their clusters is approximately 50 %,and the lowest is 34.39%.This means that in more than half of the training time,the GPUs are idle and waiting for network communication,which makes communication one of the main bottlenecks of distributed machine learning.This communication bottleneck mainly comes from two dimensions:(1)the under-lying network lacks of effective management.For example,there are no effective ap-proaches to schedule the data flows in and out of the network,leading to the poor efficiency of network utilization.There are no effective congestion control mechanism within the network,leading to the poor performance of the network transmission?(2)The communi-cation strategies adopted by the training jobs in the upper application level is not efficient enough,which increases the communication cost of the training.The research of this thesis focuses on the above two dimensions and is divided into the following three parts:1.The research of flow scheduling at the edge side of the network.When schedul-ing the flows of distributed machine learning jobs,it is necessary to fully consider the requirements of the development process of machine learning models.The process in-cludes operations like data preprocessing,feature engineering,model design and hyper-parameter optimization.Any change of the operations can affect the final quality of the trained model.Developing a high-quality model requires tuning configurations for the above operations and training the model,adjusting these configurations according to the results of training feedback,and then training the model again.Such a process repeats until a set of configurations that can train a high-quality model is searched.The search of hyperparameter configuration is the most time-consuming and resource consuming opera-tion in the whole development process.Therefore,this thesis divides the problem of flow scheduling into two sub-problems.First,the thesis research the problem of flow schedul-ing to accelerate the search of hyperparameters,and then research the problem of flow scheduling to accelerate the whole development process.For the search of hyperparame-ters,the research of the thesis finds that a group of cooperative jobs(cojobs)serving for hyperparameter-searching have obvious stage characteristics,and a scheduling scheme is designed by primal-dual method to minimize the stage completion time of cojobs,and a network scheduling system,namely Grouper,is implemented.For the whole development process,the thesis finds that whether a job can be allowed to continue the training depends on the performance feedback of its early training stage,and thus proposes job-progress-aware flow scheduling approach to minimize the completion time of early training stages,and a network scheduling system,namely JPAS,is implemented.2.The research of congestion control in the network.The worker nodes of a dis-tributed training job requires exchanging model information and control information.The model information includes model parameters and gradients,and exchanging these in-formation generates a number of large flows.On the contrary,the control information includes state packet and heartbeat packet,which are used to monitor and control the state of the training.The control information is relatively small and the required bandwidth is also small,but it is sensitive to network latency.Through the cluster experiment,the re-search of the thesis finds that the large flows of exchanging model information can quickly build long queues at switches,which will lead to a long queuing delay for the control in-formation and thus increases the communication cost of the training.Furthermore,if the network congestion is more serious,frequent packet loss occurs,which makes the con-trol information unable to be completed before its deadline.Then the control mechanism of distributed training misunderstands the training state and stops training.Motivated by the above phenomena,the thesis designs an explicit congestion control scheme that can maintain a small length of switch queue in the network.On the one hand,it provides high throughput for model synchronization,on the other hand,it provides low delay and low packet loss rate for those control information,which reduces the communication cost and prevents the training from stopping.3.The research of communication-efficient distributed learning algorithm.Due to the dynamic arrival/departure of training jobs,the dynamic allocation of resources,and the difference of the underlying network capacity,the link-speeds between worker nodes are usually heterogeneous and dynamic.Existing distributed training algorithms usually ignore the actual situation of the network and simply assume that the link-speeds be-tween nodes are static and heterogeneous,which makes the nodes frequently communicate through low-speed links in production clusters and thus results in high communication cost.This thesis proposes a decentralized asynchronous distributed training algorithm,where each node adaptively selects the neighbor nodes to communicate with according to the network situation,so that the worker nodes can perform training in the communication topology constructed by high-speed links.The thesis mathematically proves the conver-gence of the proposed learning algorithm,and performs exclusive experiments to show that the proposed algorithm significantly outperforms state-of-the-art algorithms.
Keywords/Search Tags:machine learning, GPU cluster, flow scheduling, congestion control, dis-tributed learning algorithm
PDF Full Text Request
Related items