Font Size: a A A

Bound Action Policy For Reinforcement Learning Exploration

Posted on:2019-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:J N HuangFull Text:PDF
GTID:2428330566983461Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of traditional continuous control technology such as PID(Proportional Integral Derivative)controller,SLAM(Simultaneous Localization and Mapping)technology,it is available to do the continuous control tasks with high accuracy.For example,multi-joint arm pushing object task,multi-joint hands grabbing fixed object task.But traditional continuous control methods have lots of parameters,need handcraft tuning,and have bad performance in difficult control tasks.Reinforcement learning takes the environment's feedback to learn a policy to solve the task,can solve the above problems of traditional continuous control technology.But because of reinforcement learning's low data efficiency,it takes lots of time for training.So it's very important to propose a method to improve the data efficiency in reinforcement learning.In reinforcement learning for continuous control tasks,it's common to use gaussian policy with no bounded action space to represent the agent's policy,gaussian policy means a gaussian distribution to represent the probability of action the agent select according to the environment state now.And by using the feedback given by environment,there are policy gradient methods,such as Reinforce algorithm,policy optimization methods,such as policy optimization methods,such as TRPO(Trust Region Policy Optimization)algorithm,PPO(Proximal Policy Optimization)algorithm to estimate the policy gradient samples to update the policy.But taking gaussian policy with no bound action space to represent the agent's policy with bounded action space in reality will introduce boundary effect,this will bring bias in estimating the policy gradient sample.And for encouraging the exploration for actions that are not take by agent previously,the variance of gaussian distribution must be in a reasonable range.This will make the sampling points dispersed,and increasing the variance of policy gradient sample,slowing down the training convergence.This paper proposes a bound action policy named logistic gaussian policy,which theoretic proven it can eliminate the boundary effect compare with the original gaussian policy and reduce the variance when estimating the variance of policy gradient sample.As experimental results show that in both simple and complicated continuous control tasks,with logistic policy to represent the agent's policy can achieve better performance and bring faster training convergence for Policy Gradient methods such as TRPO algorithm and PPO algorithm.
Keywords/Search Tags:Reinforcement Learning, Gaussian Policy, Bound Action, Boundary Effect, Policy Gradient
PDF Full Text Request
Related items