Since the 1980s,multi-agent cooperation algorithm has a wide application prospect in the fields of video games,resource planning,traffic scheduling,military,and so on.In recent years,with the remarkable achievements of Deep Reinforcement Learning(DRL)in many problems,Multi-Agent Reinforcement Learning(MARL)based on DRL has also become the main method to solve the problem of multi-agent cooperation.However,in the most MARL method,agents still explore the cooperation policy based on the original state-action space,which makes the exploration space of the cooperation policy too large,resulting in low exploration efficiency,and the policy can only converge to the suboptimal solution,or cannot converge to a stable policy.When the problem scale increases,the exploration space increases exponentially,which further aggravates the above problems.To solve the problem that the exploration space of Multi-Agent Reinforcement Learning is too large,this paper introduces the layered idea based on goal-conditioned policy into the Multi-Agent Reinforcement Learning method and proposes Goalconditioned Hierarchical Multi-agent Actor-Critic(GHMAC)algorithm.In the GHMAC algorithm,the original policy of each agent is decoupled into two layers of sub-policies.The upper policy is responsible for determining the current goal,and the lower policy is responsible for interacting with the environment to achieve this goal.The cooperation among agents is completely based on the cooperation between the upper policy,compressing the exploration space of the cooperation policy from the original state-action space to a subset of the state space.A cooperative navigation task with various constraints is designed in an open-source multi-agent experimental environment.A series of experiments are carried out based on this task,and the results show the overall learning curve and final task completion rate of GHMAC algorithm are better than the representative multi-agent actor-critic algorithm in recent years.In order to further improve the accuracy of policy evaluation,based on GHMAC,this paper further proposes Goal-conditioned Hierarchical Multi-agent Actor-Critic with Proximal Policy Optimization(GHMAC-PPO)algorithm,which is suitable for discrete action space problems.Firstly,this paper analyzes the global state representation method under Centralized Training & Decentralized Execution framework and puts forward the agent-specific global state representation method.When evaluating the current policy in the training process,this representation method can eliminate redundant information and consider the differences between agents,so as to further improve the learning efficiency of the algorithm.Then,this paper uses Proximal Policy Optimization(PPO)algorithm as the updating algorithm of lower policy to improve the exploration efficiency of the agent and make the method suitable for discrete action space problems.The experimental results under the cooperative navigation task with discrete action space show that the performance of GHMAC-PPO is better than the multi-agent statistical policy gradient algorithm in recent years.At the same time,a series of ablation experiments also prove the effectiveness of the two improvements. |