The emergence of deep reinforcement learning effectively solves the dimensional disaster problem encountered in reinforcement learning.When an agent is in a high-dimensional environment,deep reinforcement learning uses deep neural networks to extract features from environments,and uses reinforcement learning methods to learn an agent policy.With the successful application of deep reinforcement learning in single-agent environments,more and more researchers have begun to apply it in multi-agent collaborative environments.Unlike single-agent environments,each agent's policy is constantly changing during training in multi-agent collaborative environments,so the agent is always in a dynamic environment,making it difficult for the agent's policy to converge.Multi agent collaboration also needs to solve the communication problem between agents.Effective communication mechanism can accelerate the learning speed of the agent's policy.At the same time,with the increasing number of agents in the environment,the state space of agents becomes huge,and the convergence of multi-agent collaborative algorithm will face lots of challenges.In view of the above problems in multi-agent collaboration,this thesis mainly conducts the following research:(1)In order to help agents stabilize the learning environment in a multi-agent environment,this thesis uses the framework of centralized training with decentralized execution(CTDE)to extend the maximum entropy deep reinforcement learning algorithm Soft Actor-Critic(SAC)and proposes the multi-agent deep reinforcement learning algorithm MASAC based on the maximum entropy framework.When an agent is training,it can use additional information in the environment,including the observations and actions of other agents,to help the agent stabilize the learning environment and improve the stability of the algorithm.In the process of execution,the agent only needs to use its own observation as the input of the policy network to make the decision of the agent's action.In order to solve the communication problem between agents in MASAC,this thesis introduces a communication device that can be shared between agents.Agents use the gate mechanism and the GRU principle to implement read and write operations on a communication device.During the training process,an agent needs to continuously learn the suitable communication methods,so that the agent can get better performance.At the same time,MASAC constructs a separate critic network for each agent,so that each agent has an independent reward function.The experimental results show that MASAC performs well in collaborative,competitive,or mixed collaborative-competitive environments.In partially observable environments,effective communication between agents improves their performance.(2)For the problem that the performance of MASAC decreases with the increase of the number of agents in the environment,this thesis uses the self-attention mechanism to improve the scalability of MASAC and proposes the algorithm ATT-MASAC.Self attention mechanism can help agents distinguish the importance of different state information in the environment by using attention weight.It gives more weight to the key information that can improve the performance of agents,while ignoring the unimportant information,so that the critic network of agents can deal with the environmental information more effectively.At the same time,each agent in ATT-MASAC has a separate self-attention mechanism model.Compared with algorithms that share attention parameters within agents,ATT-MASAC can perform better in environments with complex reward structures.The experimental results show that ATT-MASAC has better scalability in more complex multi-agent environments.This thesis has 31 figures,5 tables and 81 references. |