Font Size: a A A

Research On Deep Reinforcement Learning Method For Environment With Non-stationary Dynamics

Posted on:2022-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y PuFull Text:PDF
GTID:2518306323979719Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,deep reinforcement learning methods have gradually been applied to video games,Go,poker,robotics control and other fields successfully.However,there are still many problems and challenges,such as low sample efficiency,exploration-exploitation dilemma,abnormal sensitivity to hyperparameters,poor convergence and reproducibility and so on.Especially when the dynamics of environments(including state transition probabilistic function and reward function)changes,the deep reinforce-ment learning algorithm is particularly unstable.Therefore,how to obtain an efficient,stable and general reinforcement learning method in such environment is a very impor-tant research direction.To address these issues,this dissertation conducts research in single-agent and multi-agent scenarios respectively,the main work and innovations are as follows:In the single-agent scenarios,we proposed a latent context based soft actor critic method(LC-SAC).In this method,an additional latent context encoder module is in-droduced.The encoder utilizes the recurrent neural network structure,and take the ex-perience transition triples(state,action and reward)as inputs,and the context variable as outputs.By optimizing the contrative prediction loss function,the context vector captures the information of the environment dynamics and the recent behavior of the agent,which is critical for the effective policy optimization in the environment with non-stationary dynamics.Then combined with the soft policy iteration paradigm,the LC-SAC method alternates between soft policy evaluation and soft policy improvement.Experimental results show that the performance of LC-SAC is significantly better than the SAC algorithm on the MetaWorld ML1 tasks whose dynamics changes among dif-ferent episodes,and is comparable to SAC on the continuous control benchmark task MuJoCo whose dynamics changes slowly or doesn't change between different episodes.At the same time,we also conduct relevant experiments to determine the impact of dif-ferent hyperparameter settings on the performance of the LC-SAC algorithm and give the recommendations of hyperparameter setting.In the multi-agent scenarios,a multi-agent soft actor-critic method(mSAC)based on action-value function decomposition is proposed,which effectively combines multi-agent value function decomposition and policy-based method.The main modules in-clude decomposed Q network architecture,discrete probabilistic policy and counter-factual advantage function(optinal).Theoretically,mSAC supports efficient off-policy learning and can be applied to tasks with discrete or continuous action spaces at the same time.Tested on real-time-strategy game StarCraft II micromanagement benchmark,we empirically investigate the performance of mSAC against its variants and analyze the effects of different components.Experimental results demonstrate that mSAC signif-icantly outperforms policy-based approach COMA,and achieves competitive results with SOTA value-based approach Qmix on most tasks in terms of asymptotic perfor-mance metric.In addition,mSAC outperforms Qmix largely on many tasks with large action space.In summary,this dissertation aims at the problem of non-stationary dynamics in complex environments and proposed corresponding improved algorithms in single-agent and multi-agent scenarios,respectively.Good experimental results have been achieved,which has practical application value and has a certain driving effect on the development of reinforcement learning domain.
Keywords/Search Tags:Deep Reinforcement Learning, Non-stationary Environment, Multi-Agent, Game Operation, Robotic Control
PDF Full Text Request
Related items