Font Size: a A A

Research On Cluster Resource Scheduling Optimization For Distributed Deep Learning

Posted on:2021-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:Q P LiFull Text:PDF
GTID:2428330647450742Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of deep learning,neural network models have become complex,model training takes much time,and the demand for computing resources is large.There is an urgent need to efficiently use computing resources to speed up the model training process.General cluster schedulers have insufficient support for distributed deep learning jobs,resulting in low model training efficiency and low cluster resource utilization.Existing scheduling research for distributed deep learning jobs does not consider the impact of task placement(executing each task on specific nodes)when allocating resources(determining the number of tasks and the amount of resources of each task).In fact,for tasks with a given amount of resources,if they are placed on different nodes for execution,there is a difference in task execution efficiency due to the heterogeneity of nodes and the interference among jobs,so that the impact of task placement on resource allocation cannot be reflected.To solve the above problems,this paper first builds a model that predicts the training speed of distributed deep learning jobs based on the iterative nature of distributed deep learning jobs of the parameter server architecture.The speed model can predict the training speed of distributed deep learning jobs under different job configurations(resource allocation and task placement)and different inter-job interference.According to the memory characteristics of parameter server(PS),a general PS memory prediction model is established to predict the memory consumption of PS.Based on the speed model and PS memory model,a dynamic resource scheduling method combining resource allocation and task placement is proposed to optimize the average job completion time and improve resource utilization.The speed model will guide the generation of job configurations,and the PS memory model will guide the allocation of PS memory.Specifically,we make the following contributions.1.Based on the iterative nature of the deep learning jobs,we use the historical data of jobs' execution in the cluster,and use a deep neural network(DNN)to build a speed model to predict the training speed of jobs,so that the relationship among the training efficiency of jobs,job configurations and inter-job interference will be determined.2.We analyze the relationship among the memory requirement of the parameter server(PS),the model size of the neural network model,and the number of computing nodes(Worker)to establish a general PS memory model based on historical memory usage.Then the amount of memory allocated to PSs can be dynamically adjusted to improve the cluster memory utilization.3.We design a dynamic scheduling method which dynamically adjusts job configurations by predicting the training speed of jobs under different job configurations,and use PS memory model to guide PS memory allocation.Thereby reduce the average job completion time.4.We implement a customized scheduler prototype based on Kubernetes,and verify the scheduling method using a job trace whose job arrival time subjects to Poisson distribution.The results show that compared to related research works,the method proposed in this paper can optimize the training time of distributed deep learning jobs,and improve the cluster resource utilization.
Keywords/Search Tags:Resource Scheduling, Distributed Deep Learning, Data Center, Kubernetes
PDF Full Text Request
Related items