Font Size: a A A

Research On Multiagent Policy Optimization Based On Deep Reinforcement Learning

Posted on:2020-11-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ZhengFull Text:PDF
GTID:1488306131967659Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Recently,DRL has received extensive attention and achieved many research results.In the domain of deep reinforcement learning(DRL)and multiagent system(MAS),achieve efficient policy optimization is a key problem and has some limitations and challenges.Firstly,from the perspective of the environment,the existing DRL algorithms have limitations on handling the multimodal inputs.Secondly,from the perspective of DRL algorithms,there exists biases in the estimation of Q-value and existing DRL algorithms are unable to handle noise in the received rewards.Lastly,from the perspective of MAS,existing algorithms are unable to achieve efficient cooperation between independent learners,as well as policy optimization against non-stationary opponents.To address these,this paper focuses on research of multiagent policy optimization based on DRL algorithms and tries to develop effective policy optimization algorithm by overcoming these limitations from perspectives of environment,DRL algorithms,and MAS.The main contents of this paper are as follows:First,this paper studies the problem of policy optimization with multimodal inputs.The separated multimodal network(SMMN)is proposed to overcome the shortcoming of handling multimodal inputs.SMMN can be easily combined with vanilla DRL algorithms to handling multimodal inputs.Besides,hierarchical attention(HA)mechanism is proposed to achieve weight allocation between and within multimodal inputs,resulting in better feature extraction results.At last,a modified LSTM network is proposed to effectively handle multiple inputs.This study enhances the ability of existing algorithms in handling multimodal inputs and effectively achieving policy optimization.Second,this paper studies the problems of estimation correction and policy optimization of independent learners under noisy environments.To reduces the estimation bias in DRL algorithms,WDDQN is proposed based on weighted double estimators.The reward network(RN)is proposed to handle the noise in rewards.Meantimes,to encourage agents to achieve cooperation,the lenient reward network(LRN)is proposed based on the notion of leniency.At last,the scheduled replay strategy is proposed to achieve efficient policy optimization.In summary,this study achieves effective estimation correction,cooperative policy optimization between independent learners and improves the probability of converging to Pardons-optimal Nash equilibrium.Lastly,this paper studies the policy optimization against the non-stationary agent.Existing multiagent reinforcement learning algorithms do not explicitly classify the nonstationary opponents,but try to deal with them with one general policy.To overcome the challenge of non-stationary agents in MAS and the limitation of existing algorithms that using single response policy against the non-stationary opponent.Based on the opponent model,the rectified belied model achieves accurate opponent detection from the perspectives of reward signals and opponent behaviors.In addition,the distillation policy network(DPN)is proposed as a policy library to achieve fast policy switching,convenient policy reuse,and efficient policy storage.In summary,this study achieves accurate opponents classification and efficient strategy reusing,which shed a light on researches of playing against non-stationary opponents.In summary,this paper takes multiagent policy optimization based on DRL algorithms as the research goal and studies it from the perspectives of the environment,DRL algorithms,and MAS.Specifically,this paper studies the problems of policy optimization with multimodal inputs,policy optimization of independent learners and policy optimization against non-stationary agents.The empirical experiments confirm the effectiveness of the proposed methods.This paper focuses on both engineering practice and plays a guiding role in applying DRL algorithms in solving practical problems.Meantimes,it sheds a light on further research on multimodal reinforcement learning,policy optimization of independent learners,finding Pareto optimal Nash equilibrium and dealing with non-stationary opponents.
Keywords/Search Tags:Deep reinforcement learning, Multiagent system, Non-stationary agent, Bayesian policy reuse, Policy optimization, Moltimodal learning
PDF Full Text Request
Related items