In recent years,the research on deep reinforcement learning algorithms based on agent systems has become a hot topic in the academia,and researchers have also achieved many excellent research results.Among them,how to make the agent system learn the optimal policy stably and efficiently during the training process is an important research topic in the field of policy gradient optimization.First of all,from the algorithm point of view,the existing algorithms mainly focus on the research of deterministic strategies,deterministic strategies are evaluation deviations and limitations in the processing of noise in the reward value,resulting in less algorithm stability.Secondly,from the application point of view,in the multi-agent system,there is collaboration and gaming between agents,which make the environment complex and changeable.If the single-agent algorithm is directly applied to the multi-agent system,the existing policy gradient optimization algorithm has certain limitations on the information processing capability of the system,resulting in less environmental exploration.Aiming at the above problems,this paper analyzes the limitations of the existing algorithms,improves the algorithms,and applies the improved algorithms to complex agent systems.The main work of the paper is as follows:Firstly,this thesis proposes the Deep Stochastic Policy Gradient(DSPG)algorithm for the problem of agent policy optimization.The DSPG algorithm mainly addresses the limitations of the Deep Deterministic Policy Gradient(DDPG)algorithm,improves the deterministic policy gradient in the DDPG algorithm to a stochastic-policy-gradient,and uses two parallel value networks to improve the value network in the algorithm model to enhance the accuracy of action value evaluation,while using importance sampling to achieve the parameter update of the target policy network,thus optimizing the training process of the policy network.It is demonstrated through simulation experiments that the DSPG algorithm has stronger robustness while having more exploration ability to the environment.Secondly,this thesis analyzes the process of multi-agent interaction with the environment,and analyzes the two challenges faced by the expansion of the algorithm from a single-agent system to a multi-agent system,and proposed corresponding solutions.The first challenge is that an agent needs to consider other agents in a multi-agent system.State and decision-making,this thesis solves the structural problems of the policy network and value network in the multiagent system by drawing on the centralized training and distributed execution in the Multiagent Deep Deterministic Policy Gradient(MADDPG)algorithm,so that the multi-agent system can be trained more efficiently.with policy information.The second challenge is the difference in decision-making goals.Multi-agent systems no longer pursue the maximization of individual rewards,but the maximization of global rewards.This thesis proves that the stochastic policy gradient can be solved in the policy network evaluated by the multi-agent joint value function,and thus proves that the improved Multi-agent Deep Stochastic Policy Gradien(MADSPG)algorithm can solve the global reward maximum of the multi-agent by the gradient descent method.Finally,three multi-agent environment scenarios are designed by Gym platform,namely,hunting-escaping environment,animal world environment and action world environment with added communication mechanism.Finally,it is experimentally demonstrated that the improved MADSPG algorithm can not only be successfully applied to the multi-agent environment,but also the algorithm stability,exploration ability and computing efficiency are better than the traditional MADDPG algorithm. |