Nowadays,artificial intelligence is experiencing the development process from perception to decision-making,from single agent to multi-agent system,which makes a new focus on behavioral decision-making and control of multiple agents.Considering that multiple agents learn by interaction at the same time,the original transfer of deep reinforcement learning algorithm from single agent to multi-agent task scenarios will bring about the problem of environmental non-stationarity and dimensional explosion.In addition,there are still partial observability and slow training limitations in the decisionmaking process of multiple agents.Aiming at the above problems,based on the recently popular centralized training decentralized execution(CTDE)paradigm and Actor-Critic reinforcement learning method,this thesis designs algorithms for behavioral decision of multiple agents and performs experimental verification,and the main contributions and innovations are as follows:(1)Aiming at the non-stationarity of the environment and the insufficient exploration of the strategic space,based on the CTDE paradigm,this thesis extends the maximum entropy reinforcement learning algorithm SAC,and proposes a multi-agent deep reinforcement learning algorithm MASAC based on CTDE and maximum entropy framework.In the centralized training stage,the individual agent take into account the strategic change of other agents to overcome the non-stationary problem of the environment,and the fusion of maximizing information entropy of strategy strengthens the exploration ability and the robustness of the strategy,and the algorithm finally outputs a random strategy,which has stronger decision-making ability.Experimental results show that in cooperative,competitive and hybrid multi-agent tasks,the performance of MASAC algorithm and the actual performance of agents are better than the baseline algorithm.(2)Aiming at the problem of partial observability,based on the MASAC algorithm,a multi-agent deep reinforcement learning algorithm R-MASAC is proposed that enables agents to make behavioral decisions using current local observation,historical observation sequence information and communication messages transmitted between agents.Recurrent neural network GRU is introduced in the Actor and Critic of agent to build memory functions and employs a hidden state storage strategy.In addition,the establishment of communication channels between agents to enhance local observation helps agents make behavioral decisions,and promote collaboration between agents.The algorithm is tested on a multi-agent synchronous and fast arrival task under partially observable condition that requires high collaboration,experimental results show that the algorithm performance of R-MASAC is better than the baseline algorithm,and the trained agents can successfully complete the task with a cooperative strategy of mutual cooperation under partially observable condition.(3)Aiming at the problem of slow training and large amount of interactive experience data,based on the MASAC algorithm,this thesis proposes a distributed multiagent deep reinforcement learning algorithm with prioritized experience replay,DPERMASAC.The problem of sparse reward values is solved by introducing the prioritized experience replay mechanism,while using importance sampling weights to offset deviations due to the update of priority distribution.The algorithm is further extended to a distributed version to extend empirical data throughput and increase data diversity.Experiment results show that in multi-agent tasks of different difficulties,the algorithm performance of DPER-MASAC and the actual performance of the agents are better than the baseline algorithm.There are 46 figures,5 tables,and 98 references in this thesis. |