Font Size: a A A

Optimization Method For Reinforcement Learning Based On Overestimation Control And Exploration Enhancement

Posted on:2024-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:J W ZhangFull Text:PDF
GTID:2568307064485484Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Deep reinforcement learning methods are widely used,but still suffer from excessive training costs in real-world tasks.How to exploit the limited number of interaction steps for better performance,namely sample efficiency optimization,has become a hot topic of current research.Limited by the number of samples,the value function methods mainly face overestimation bias which caused by the accumulation of value estimation errors with the Bellman equation optimization,and the insufficiency of the agent exploration ability.Current optimizations that introduce underestimation bias to mitigate overestimation and use intrinsic reward to motivate exploration are still not efficient enough.In this thesis,we focus on how to improve the exploration and exploitation capabilities of an agent by changing its action selection policy.We split the role of the action selection policy,optimizing the actor action policy and introducing the explorer action policy,taking into account the exploitation ability during training and exploration ability during environmental interactions.Under the condition of reducing additional bias and guaranteeing convergence,we construct sample efficient algorithms based on continuous action tasks.The main research work is as follows:1)We optimize actor action selection policy based on variance control to reduce the generation and propagation of overestimation bias.We describe the causes of estimation bias generation and address the overestimation problem.Based on the double value function structure,we proof the existence of action policies that can mitigate the generation and propagation of overestimation bias.We propose an optimization of the action policy based on variance control,which can mitigate the overestimation problem and make the value function evaluation more accurate.2)We propose the explorer-actor-critic(EAC)framework to improve the sample efficiency of the value function methods.We split the role of the action policy and use an explorer network and an actor network to reduce interference in the exploration and optimization process.By three part improvements of optimizing the actor training objective to reduce the variance,introducing explorer policy to enhance the exploration ability,and using the action mixing mechanism to mitigate the experience distribution bias,we give a framework that can simultaneously balance exploration and exploitation,and combine it with the baseline algorithms to obtain the EAC-TD3(Explorer-ActorCritic based Twin Delayed Deep Deterministic)and EAC-SAC(Explorer-Actor-Critic based Soft Actor Critic)algorithms.3)Based on sparse reward tasks,we propose the ECM-TD3(Twin Delayed Deep Deterministic with Explorer Curiosity Module)algorithm to further improve the exploration ability.Using the correlation between state prediction error and the number of state action visits,and considering the sensitivity of the value function approximation to intrinsic reward and estimation bias,we introduce a state prediction network and an error estimation network to provide explorer an action gradient with potential exploration value,which improves the exploration ability while ensuring the stability of the training process.In this thesis,we provide theoretical analysis and implementation,and conduct experiments on various tasks in the MuJoCo environment and reward modification.In the benchmark tasks,the EAC-TD3 and EAC-SAC methods have better sample efficiency and final performance.In sparse and noisy reward tasks,the EAC framework can effectively reduce estimation bias and improve exploration ability,and the ECMTD3 algorithm being the most effective in terms of exploration ability.In the ablation experiments and sensitivity analysis,we validate the role of each component optimization and guide the parameter tuning.Thus,the EAC framework can improve the sample efficiency of the baseline algorithm from exploitation optimization and exploration efficiency enhancement,effectively improving the performance,avoiding the agent from prematurely stuck in suboptimality,and being compatible with other optimizations.
Keywords/Search Tags:deep reinforcement learning, sample efficiency, overestimation bias, actor-critic framework, exploration and exploitation
PDF Full Text Request
Related items