Font Size: a A A

Research On Resource Scheduling Mechanism Ofdistributed Machine Learning Cluster

Posted on:2021-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:D WangFull Text:PDF
GTID:2428330626455882Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In recent years,more and more companies have deployed distributed machine learning clusters,where machine learning models are trained for providing various AI-driven services.At the same time,machine learning workloads also bring some unique challenges to cluster resource scheduling(matching appropriate resources for workloads),such as complex resource-performance relationships and unpredictable job completion times.Most of the existing related researches set up white-box models of resource-performance relationships based on the researchers' understanding of specific machine learning frameworks and workloads,and propose scheduling heuristics accordingly.The heuristics highly depends on the accuracy of the performance model,inaccuracies in the model may lead to heuristics that are far from optimal.And the model is not universal.Therefore,this thesis introduces black-box optimization technology to solve the resource scheduling problem of distributed machine learning clusters,and made the following contributions:1.When the machine learning models in the cluster are production model,users are mainly concerned about the training time of the model.The resource scheduling problem mainly considers how to select the appropriate physical computing node for each job to minimize the total job completion time.In this thesis,a formal mathematical model of the problem is performed,and the challenges of the problem are analyzed in depth.A resource scheduling algorithm based on Bayesian optimization is proposed.Bayesian optimization is introduced for the first time to solve this problem,and a method of learning the convergence curve is proposed to solve the problem that the number of iterations that a job needs to run to achieve model convergence is unkown.Four Bayesian optimization algorithms of different Bayesian statistical model and acquisition function combinations,and other currently leading researchs' algorithms were compared experimentally.The experimental results show that the Bayesian optimization algorithm can indeed find the optimal or near-optimal resource configuration with the least search cost,and when the Bayesian statistical model is Gaussian process and the acquisition function is EI(Expected Improvement)performs best.When the Bayesian statistical model is Random Forest,it can significantly reduce the computational complexity,which is more practical for very large-scale clusters.2.When the machine learning models in the cluster are experimental model,users mainly concerned about the performance(such as accuracy)of the model.This thesis analyzes the main points of a resource scheduling algorithm for distributed machine learning clusters.Based on these analyses,this thesis proposes the Metis algorithm,a resource scheduling algorithm based on deep reinforcement learning.Metis aims at maximizing the overall performance of machine learning models in the cluster.It uses a periodic scheduling strategy with fixed time slicing,and a deep reinforcement learning agent makes scheduling decisions to proactively adjust the resource allocation of each workload.This thesis designs the state,action and reward of reinforcement learning model.And in the design of state,the parameters of loss function curve and the resource-performance model,which are both learned on small-scale clusters and small-scale datasets,are innovatively used to encode distributed machine learning jobs.Finally,this thesis implements Metis and compares it with leading research.The experimental results show that Metis can improve the overall performance of machine learning models in the cluster,and reduce the waiting time of users in the process of model selection.
Keywords/Search Tags:Distributed machine learning, Resource scheduling, Bayesian optimization, Deep reinforcement learning
PDF Full Text Request
Related items