Font Size: a A A

Design And Implementation Of Cluster Scheduler System For Machine Learning Jobs

Posted on:2020-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:C GaoFull Text:PDF
GTID:2428330623463777Subject:Software engineering major
Abstract/Summary:PDF Full Text Request
With widespread applications in image recognition,language translation,style transferring,natural language processing and other areas,machine learning has been applied in many industries.Currently,there are still some problems to run machine learning business on large shared clusters.From the perspective of resource-based view,there is heterogeneity and sensibility of model training jobs in terms of resource usage.From the perspective of jobs,machine learning model training jobs are more complicated in types and priorities compared with traditional jobs,which should take care of remaining runtime,training speed,distributed architecture,and other aspects.From the perspective of high-level business,some businesses such as automatic machine learning place demands on resources,which may affect the utilization.Today,the developers have to maintain the cluster manually.The workflow imports higher human cost and extends the end-to-end time used by model training jobs.Based on the characteristics of machine learning model training jobs,we propose a cluster scheduling system for machine learning workloads,which achieves better utilization of hardware accelerator resource,reduces the job completion time and provides multi-dimensional priority-based preemption and degradation.Main contributions of the thesis are as follows:(1)We introduce a resource scheduling algorithm based on utilization prediction from the perspective of resources.Aiming at common machine learning scenarios,a series of experiments are designed to demonstrate the bottleneck of distributed machine learning jobs in resources and network.Based on this problem,a tree-based scheduling algorithm based on cluster utilization is proposed.In this thesis,distributed machine learning jobs are scheduled in small-scale clusters which have a better network connection.The algorithm avoids network bottleneck problems among distributed machine learning jobs,improves the efficiency of hardware accelerator resources and reduces the completion time of training jobs.(2)We propose a job preemption and resource degradation strategy based on multi-dimensional priority from the perspective of jobs.This thesis proposes a job preemption strategy based on multi-dimensional priority.At the same time,in order to prevent chain-preemption problem,we propose the strategy of degraded use of preempted job from hardware accelerator resources to CPU,which further improves the training speed of high priority jobs and ensures high utilization of hardware accelerator resources.(3)We design the early stop method for automatic machine learning jobs from the perspective of business.This thesis proposes a resource restriction strategy based on predicted training time and model performance.Through the early prediction of different parameter choices,it classifies them and increases the resources of high-quality parameter combinations.The strategy ensures that automatic machine learning jobs will not overuse cluster resources.(4)We implement a prototype machine learning platform based on the researches.This thesis proposes a prototype system for end-to-end machine learning workflow.The prototype system supports data preprocessing,model training,model deployment,benchmarking and automatic machine learning and other related functions.Finally,the prototype system is evaluated by a case study of computer vision scenario,which demonstrates the effectiveness of the research methods in this thesis.
Keywords/Search Tags:Cluster Scheduling, Resource Management, Machine Learning, Priority Scheduling
PDF Full Text Request
Related items