Font Size: a A A

Research On Mitigating Straggler And Job Scheduling For Parameter Server

Posted on:2021-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:J Y LuFull Text:PDF
GTID:2428330614965996Subject:Logistics engineering
Abstract/Summary:PDF Full Text Request
A large number of Machine Learning(ML)tasks have led to a dramatic increase in computing scale.Therefore,improving the utilization of cluster resources is a difficult problem facing the current network.Under the framework of the parameter server,this paper first proposes a resource allocation algorithm based on Deep Reinforcement Learning(DRL),which alleviates the problem of abnormal tasks.Next,a topology-aware scheduling algorithm is proposed to enable efficient communication between GPUs.The main contributions of this article are as follows:(1)Aiming at the problem of low cluster utilization caused by abnormal tasks,this paper proposes a parameter server-based abnormal task processing architecture.Specifically,the highly dynamic state of the cluster is first considered in the parameter server architecture to resolve abnormal tasks.Then,based on DRL,a flexible help control synchronization mechanism is proposed to determine the help node of each node.Finally,the improved Asynchronous Advantage Actor Critic(A3C)algorithm arranges distributed agents at each working node,thereby taking appropriate actions to balance the overhead of each node in a discrete state space.(2)Aiming at the problem of low cluster utilization caused by uneven communication bandwidth between GPUs,this paper proposes a resource-time model based on the number of working nodes and GPU layout topology to improve communication efficiency.According to this model,a TOPO-PS algorithm for parameter server topology is proposed to implement the resource placement strategy based on graph mapping algorithm.
Keywords/Search Tags:Parameter Server, Straggler, Scheduling, Reinforcement Learning
PDF Full Text Request
Related items