Font Size: a A A

Resource Allocation And Scheduling In Big Data Clusters

Posted on:2022-04-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y X GaoFull Text:PDF
GTID:1488306728965079Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In recent years,machine learning methods have been widely used in many fields including image recognition,machine translation,and speech recognition,etc.Due to the increase of data volume and algorithmic complexity,running a machine learning algorithm on a single computation resource can hardly satisfy its huge demand for data processing capabilities.As a result,modern machine learning algorithms usually run in parallel on a cluster system composed of a large number of computation resources.However,even if it consumes a lot of expensive computation resources,the time efficiency of machine learning algorithms is still very low.For example,training a deep neural network on dozens of GPUs often takes days or even weeks.Such high consumption of resources and time restricts the wider popularization and application of machine learning methods.The time efficiency of machine learning algorithms running on a cluster is closely related to the allocation and scheduling of resources in the cluster.This dissertation studies how to improve the time efficiency of machine learning algorithms through better allocating and scheduling cluster resources.The typical way of a machine learning algorithm running on a cluster system is to divide the computations involved in the algorithm into multiple computation tasks,which are executed simultaneously on multiple different computation resources.This“divide and conquer" mode of operation brings many challenges to the efficient allocation and scheduling of cluster resources.In order to address these challenges,this dissertation has conducted the following four aspects of research:(1)Efficient and low-delay task scheduling.Task scheduling studies how to assign each task of the machine learning algorithm to the computation resources according to the load status of computation resources in the cluster,so that the loads on computation resources are well balanced.Existing cluster schedulers need to send a large number of probe messages to inquire about the load status of each computation resource when assigning tasks to computation resources,which causes significant message overheads and scheduling delays.This dissertation designs an overhead-free and low-delay cluster scheduler TASCO,which eliminates the probe message overheads by actively reporting the load status of each computing resource to the scheduler.In addition,TASCO uses the smallest-tasks-first scheduling discipline,which improves existing schedulers' firstin-first-out scheduling discipline.We analyze the average task completion time under the TASCO scheduler based on queueing theory.Simulation results show that TASCO significantly reduces the average task completion time compared to existing schedulers.(2)Priority allocation in coflow scheduling.The communication traffic among the computation tasks of a machine learning algorithm constitutes a set of data flows in the cluster network,also called a coflow.When multiple machine learning algorithms share a cluster,multiple coflows often compete for bandwidth resources of the network.When this happens,the existing solution is to assign a transmission priority to each coflow.Because the priority determines the transmission order of coflows in the network,priority allocation has a significant impact on the completion time of coflow.However,existing priority allocation methods can not minimize average coflow completion time.Based on queueing theory,this dissertation analyzes the effect of priority allocation on the average coflow completion time.Based on this analysis,we design a priority allocation algorithm that can minimize the average coflow completion time.The simulation results show that the priority allocation algorithm proposed in this dissertation significantly reduces the average coflow completion time.(3)Device placement of deep neural networks.The device placement problem studies how to allocate(place)the computation operations of a deep neural network to each computation resource to minimize the training time of the deep neural network.Based on Proximal Policy Optimization(PPO),this dissertation designs a reinforcement learning algorithm,called Spotlight,for finding device placement scheme.Spotlight gradually improves its device placement skills through repeated device placement trials and training time feedbacks.Experiments based on the Google Cloud platform have demonstrated that,under the placement scheme found by the Spotlight algorithm,the training time of the deep neural network is significantly lower than existing placement schemes.In addition,the learning efficiency of the Spotlight algorithm is much higher than the device placement learning algorithm based on policy gradient proposed by Google.Furthermore,this dissertation also designs an even more efficient device placement learning algorithm,called Post,that integrates cross-entropy minimization and PPO.Experiments based on the Google Cloud platform have demonstrated that the learning efficiency of the Post algorithm is better than Spotlight algorithm.The placement scheme found by the Post algorithm significantly reduces the training time of several widely used deep neural networks,including Inception-V3,neural machine translation,etc.(4)Packet routing in cluster networks.The packet routing problem studies how to select an appropriate next-hop network node for forwarding a data packet according to the network status,with the goal of minimizing the average latency of packets.The existing packet routing algorithms for cluster networks are heuristics and lack performance guarantees.This dissertation proposes a packet routing algorithm for cluster networks based on reinforcement learning,which learns an optimal routing strategy through simulated routing experiences.The MLPI algorithm adopts the maximum likelihood value estimator with the best sample efficiency,which overcomes the sample inefficiency of the Monte Carlo and Temporal-Difference value estimators used in current reinforcement learning methods.Simulation experiments based on NS-3 show that the learning efficiency of the MLPI algorithm is much higher than that of the existing reinforcement learning algorithms.Compared with heuristic routing strategies,the routing strategy learnt by the MLPI algorithm significantly reduces the number of queueing packets and the average latency of packets.
Keywords/Search Tags:Big data cluster, machine learning, flow scheduling, task scheduling, device placement, packet routing
PDF Full Text Request
Related items