Research On Mitigating Straggler And Job Scheduling For Parameter Server

Posted on:2021-05-29

Degree:Master

Type:Thesis

Country:China

Candidate:J Y Lu

Full Text:PDF

GTID:2428330614965996

Subject:Logistics engineering

Abstract/Summary:

PDF Full Text Request

A large number of Machine Learning(ML)tasks have led to a dramatic increase in computing scale.Therefore,improving the utilization of cluster resources is a difficult problem facing the current network.Under the framework of the parameter server,this paper first proposes a resource allocation algorithm based on Deep Reinforcement Learning(DRL),which alleviates the problem of abnormal tasks.Next,a topology-aware scheduling algorithm is proposed to enable efficient communication between GPUs.The main contributions of this article are as follows:(1)Aiming at the problem of low cluster utilization caused by abnormal tasks,this paper proposes a parameter server-based abnormal task processing architecture.Specifically,the highly dynamic state of the cluster is first considered in the parameter server architecture to resolve abnormal tasks.Then,based on DRL,a flexible help control synchronization mechanism is proposed to determine the help node of each node.Finally,the improved Asynchronous Advantage Actor Critic(A3C)algorithm arranges distributed agents at each working node,thereby taking appropriate actions to balance the overhead of each node in a discrete state space.(2)Aiming at the problem of low cluster utilization caused by uneven communication bandwidth between GPUs,this paper proposes a resource-time model based on the number of working nodes and GPU layout topology to improve communication efficiency.According to this model,a TOPO-PS algorithm for parameter server topology is proposed to implement the resource placement strategy based on graph mapping algorithm.

Keywords/Search Tags:

Parameter Server, Straggler, Scheduling, Reinforcement Learning

PDF Full Text Request

Related items

1	Effective Straggler Mitigation With Cross-layer Interference-aware Optimization
2	Research On Workshop Scheduling Based On Genetic Algorithm With Reinforcement Learning
3	Research On Cooperative Hybrid Parameter Update For Data Parallel Deep Learning Training Jobs
4	A Research Of Straggler Strategy For Heterogeneous Spark Cluster
5	On Dynamic Scheduling Method Based On Averaged Reinforcement Learning Algorithm
6	Research On Data Center Network Traffic Scheduling Based On Deep Reinforcement Learning
7	Research And Implementation Of Reinforcement Learning Method About Transport Strategy Between Carrier-based Aircraft Station
8	Research On Multipath TCP Scheduling Based On Reinforcement Learning
9	AGV Task Scheduling Method Based On Transfer Reinforcement Learnin
10	Dynamic Task Scheduling Algorithm And Platform Based On Reinforcement Learning