Font Size: a A A

Research On Regularized Policy Gradient

Posted on:2020-11-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:L T LiFull Text:PDF
GTID:1368330605972474Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Control learning is a main task of reinforcement learning(RL).The task of control learning is to learn an optimal policy so as to maximize the expected return.Recently,lots of advance policy gradient algorithms have been proposed to address the reinforcement learning problem with continuous action space.These include reducing the computational complexity to O(n),reducing the variance of policy gradient estimation,improving the stability of the algorithm,extending to off-policy learning and POMDP setting.RL is an important machine learning(ML)method,and can also be over-fitted to data when the function space of value function approximation and policy are large,which is the same as other ML methods.However,the research on regularization in policy gradient estimation is largely unexplored.Using parameter norm penalties,we focuses on regularized policy gradient algorithms that restrict the policy parameters to control the capacity of models.The high level contribuctions of this research are as follows:1.We propose a new a new AC framework named critic-iteration policy gradient(CIPG),which learns the state value function of current policy in an on-policy way and performs gradient ascent in the direction of improving discounted total reward.During each iteration,CIPG keeps the policy parameters fixed and evaluates the resulting fixed policy by l2-regularized RLS-TD critic.Our convergence analysis extend previous convergence analysis of policy gradient with function approximation to the case of RLS-TD critic.Simulation results demonstrate that the l2-regularization term in the critic of CIPG is undamped during the learning process,and CIPG has better learning efficiency and faster convergence rate than conventional actor-critic learning control methods.2.l1-regularization is used on the actor network to achieve the function of feature selection.In each iteration,policy parameters are updated by the regularized dual averaging(RDA)technique,which solves a minimization problem that involves two terms:one is the running average of the past policy gradients,and the other is the 1-regularization term of policy parameters.Our algorithm can efficiently calculate solution of the minimization problem,and we call the new adaptation of policy gradient RDA-policy gradient(RDA-PG).The proposed RDA-PG can learn stochastic and deterministic near-optimal policies.The convergence of the proposed algorithm is established based on the theory of two time-scale stochastic approximation.Simulation and experiment results show that RDA-PG performs feature selection successfully in the actor and learns sparse representations of the actor both in stochastic and determinist cases.3.A regularized deep reinforcement learning off-policy actor-critic algorithm is proposed based on l1/l2 parameter regularization technique to realize regularization in the actor.Our objective can be sloved by AutoGrad of Pytorch and TensorFlow to optimize the weights of both the actor and the critic.The resulting algorithm can solve reinforcement learning problem with continuous action space.We introduce l1/l2 parameter regularization in the actor based on the work of soft-AC.We propose e l1/l2 parameter regularization objective function,define the corresponding value function and Bellman equation,and prove that our regulaized policy iteration convergence to the optimal policy on tabular setting.We then extend this method into the general continuous setting by using function approximation.The objective function of the actor is defined by the Bellman equation.The solutions of our policy gradient method are introduced by the likelihood ratio gradient estimator and reparameterization trick.We also automate the regularization parameter adjustment process based on dynamic programming principle.
Keywords/Search Tags:reinforcement learning, parameter regularization, policy gradient, policy evaluation, function approximation
PDF Full Text Request
Related items