A large number of Machine Learning(ML)tasks have led to a dramatic increase in computing scale.Therefore,improving the utilization of cluster resources is a difficult problem facing the current network.Under the framework of the parameter server,this paper first proposes a resource allocation algorithm based on Deep Reinforcement Learning(DRL),which alleviates the problem of abnormal tasks.Next,a topology-aware scheduling algorithm is proposed to enable efficient communication between GPUs.The main contributions of this article are as follows:(1)Aiming at the problem of low cluster utilization caused by abnormal tasks,this paper proposes a parameter server-based abnormal task processing architecture.Specifically,the highly dynamic state of the cluster is first considered in the parameter server architecture to resolve abnormal tasks.Then,based on DRL,a flexible help control synchronization mechanism is proposed to determine the help node of each node.Finally,the improved Asynchronous Advantage Actor Critic(A3C)algorithm arranges distributed agents at each working node,thereby taking appropriate actions to balance the overhead of each node in a discrete state space.(2)Aiming at the problem of low cluster utilization caused by uneven communication bandwidth between GPUs,this paper proposes a resource-time model based on the number of working nodes and GPU layout topology to improve communication efficiency.According to this model,a TOPO-PS algorithm for parameter server topology is proposed to implement the resource placement strategy based on graph mapping algorithm. |