Bound Action Policy For Reinforcement Learning Exploration

Posted on:2019-07-27

Degree:Master

Type:Thesis

Country:China

Candidate:J N Huang

Full Text:PDF

GTID:2428330566983461

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of traditional continuous control technology such as PID(Proportional Integral Derivative)controller,SLAM(Simultaneous Localization and Mapping)technology,it is available to do the continuous control tasks with high accuracy.For example,multi-joint arm pushing object task,multi-joint hands grabbing fixed object task.But traditional continuous control methods have lots of parameters,need handcraft tuning,and have bad performance in difficult control tasks.Reinforcement learning takes the environment's feedback to learn a policy to solve the task,can solve the above problems of traditional continuous control technology.But because of reinforcement learning's low data efficiency,it takes lots of time for training.So it's very important to propose a method to improve the data efficiency in reinforcement learning.In reinforcement learning for continuous control tasks,it's common to use gaussian policy with no bounded action space to represent the agent's policy,gaussian policy means a gaussian distribution to represent the probability of action the agent select according to the environment state now.And by using the feedback given by environment,there are policy gradient methods,such as Reinforce algorithm,policy optimization methods,such as policy optimization methods,such as TRPO(Trust Region Policy Optimization)algorithm,PPO(Proximal Policy Optimization)algorithm to estimate the policy gradient samples to update the policy.But taking gaussian policy with no bound action space to represent the agent's policy with bounded action space in reality will introduce boundary effect,this will bring bias in estimating the policy gradient sample.And for encouraging the exploration for actions that are not take by agent previously,the variance of gaussian distribution must be in a reasonable range.This will make the sampling points dispersed,and increasing the variance of policy gradient sample,slowing down the training convergence.This paper proposes a bound action policy named logistic gaussian policy,which theoretic proven it can eliminate the boundary effect compare with the original gaussian policy and reduce the variance when estimating the variance of policy gradient sample.As experimental results show that in both simple and complicated continuous control tasks,with logistic policy to represent the agent's policy can achieve better performance and bring faster training convergence for Policy Gradient methods such as TRPO algorithm and PPO algorithm.

Keywords/Search Tags:

Reinforcement Learning, Gaussian Policy, Bound Action, Boundary Effect, Policy Gradient

PDF Full Text Request

Related items

1	Research On Fast Policy Gradient Algorithms Of Reinforcement Learning Based On Adaptive Learning Rate
2	Research On Regularized Policy Gradient
3	Research On Agent Decision-making And Control Based On Deep Reinforcement Learning
4	Research On Multiagent Cooperation And Applications Based On Reinforcement Learning
5	Theories, Algortihms And Applications Of Policy Gradient Reinforcement Learning
6	Deep Reinforcement Learning Based On Policy Gradient Optimization And Its Application In Agent Control
7	Research On Policy Gradient Methods Based On Functional Gradients
8	Research On Reinforcement Learning Methods Based On Direct Policy Search
9	Research On Policy Iteration Algorithm Within Bayesian Reinforcement Learning
10	Optimization On Deep Reinforcement Learning Based On Policy Gradient