| Reinforcement learning has been applied to solve the challenging problems in the field of game,robot and industrial unmanned driving.However,most of the success of reinforcement learning is in the field of a single agent.In reality,many problems such as the control of UAV cluster and multiplayer game are multi-agent systems,aiming at more and more Multi-Agent environments,The development of multi-agent algorithms can improve the performance of existing algorithms,which has practical significance for the use of Multi-Agent Reinforcement Learning in the future.The existing multi-agent algorithms often target specific environments.The valuebased method learns faster and performs better in a discrete environment,but cannot adapt to a continuous environment,and there is an instability problem in the training process.Algorithms based on Actor-Critic structure have problems of slow learning efficiency and excessive network structure.Starting from the network structure and training methods,this paper makes targeted improvements to the problems of multi-agent algorithms in practice.The main works can be summarized as follows:Firstly,aiming at the problems of low exploration efficiency,reputation allocation and low efficiency of action selection in the value-based method,the fusion method is used to improve the multi-agent collaborative learning algorithm.Noise network is used to increase the exploratory and robustness of the algorithm,dueling network is used to introduce state value flow and action advantage flow into the network,and value decomposition method is used to solve the reputation allocation problem.In the experiment,it is compared with IQL and VDN and has achieved effective improvement.Secondly,aiming at the problem of low training efficiency and poor stability of value method,the training process is improved.Because of the existence of sparse reward problem,reinforcement learning spends a lot of time in the early stage of error exploration.Starting from experience replay,based on the fact that the state close to the final goal is more likely to get rewards,the reverse random exploration from the final state realizes the expansion of the experience pool,and solves the problems of early training of multiagent algorithms and low training efficiency.Finally,the two-layer advantage Actor-Critic algorithm with centralized training and distributed execution is implemented.The advantage function is added to Actor and Critic respectively to show the difference of different agents’ action states in counterfactual baseline,which achieves better learning effect than COMA algorithm in many environments.At the same time,compared with value-based method,Actor-Critic structure can realize cooperation in more complex environments. |