By interacting with the environment,agents utilize reinforcement learning to optimize policies in order to maximize rewards or complete specific tasks.Combining reinforcement learning and deep learning to form deep reinforcement learning,it not only has powerful feature extraction ability and expression ability to perceive agent attribute information and environmental information,but also has strong exploration ability to adapt to the dynamic changes of complex environment,showing good performance on multiple complex problems.Especially in multi-agent collaborative decision-making tasks,multi-agent deep reinforcement learning has become a research hotspot,and has been generally used in various fields such as UAV formation coordination,transportation hub control,and intelligent logistics.Therefore,multi-agent deep reinforcement learning has important value both in theoretical research and practical application.In practical application systems,a single agent usually only has the local observation ability,that is,the non-globally knowable environment multi-agent system.When completing tasks with high collaboration requirements,the close cooperation between agents can maximize the interests of the team.However,under the condition of non-globally knowable environment,each agent has limited cognition of complex environments and requires communication for coordination.Therefore,how to enhance the ability of agents to perceive the environment through the effective communication between agents and effectively improve the quality of decision-making is an important content of multi-agent system research.Based on this,this paper studied the communication strategy learning of agents in the process of multi-agent cooperative decision-making in a non-globally knowable environment,and proposed two multiagent reinforcement learning methods for the effective identification and processing of messages in the communication process and the optimization of communication resources.The specific research contents include:(1)Aiming at the problem of message redundancy and noise in the communication process,this paper proposed a multi-agent reinforcement learning method based on attentional message sharing which called AMSAC.Specifically,first,based on the multi-agent actor-critic architecture,the agent message sharing space is built.The agent reads and writes messages to the shared space,constructs global environment information,and solve the problem of lack of communication between agents in non-globally knowable and complex tasks;Secondly,an attention mechanism is established in the message sharing space to identify important information and process it to improve the message processing performance of the multi-agent system;Finally,in the centralized critic network,the global state and action information is fully utilized,and the temporal difference dominance policy gradient is used to reasonably evaluate the value of the agent’s actions.Experiments are carried in a multi-agent cooperative confrontation environment,and the results show that AMSAC outperforms the other baselines in four different scenarios.(2)In view of the good performance of the multi-agent value function decomposition method in solving non-stationarity and scalability problems,facing its inconsistency in the decentralized execution process,this paper proposes a multi-agent reinforcement learning method based on information theory optimization which called BESQ.Based on the multiagent value function decomposition architecture,BESQ designs two kinds of communication message regularization optimizers based on the information theory optimization technology,and then constructs the communication resource optimization mechanism between agents,which solving the problem that the value function decomposition method lacks coordination in the decentralized execution process.Specifically,first,in order to enhance the expression of the agent’s communication message,a regularization optimizer that maximizes the mutual information entropy between the agent’s message and action selection is established,which reducing the uncertainty of other agents’ action value functions;at the same time,in order to optimize the succinctness of agents’ communication messages,a regularization optimizer that minimizes agents’ message entropy is established,so that the messages communicated by agents contain the most important information for decision-making.Finally,based on the multiagent value function decomposition method Qatten,BESQ realize the above communication resource optimization mechanism,which make the value function decomposition and the communication learning method combine organically.Experiments are carried in a multi-agent cooperative confrontation environment,and the results show that BESQ outperforms the other baselines in four different scenarios. |