Font Size: a A A

Robust Policy Gadient Algorithm Based On Actor-Critic In Deep Reinforcement Learning

Posted on:2022-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:W Q ZhaoFull Text:PDF
GTID:2558307070952389Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,Deep Reinforcement Learning(DRL)has made huge progress and achieved great success in different fields,such as competitive games(Atari 2600,Go,etc.),robot navigation and control tasks.Given a control problem that can be defined as a finite Markov decision process(MDP),it is dedicated to learning an optimal policy through extensive trial and error,and the agent can obtain the highest cumulative reward through the optimal policy.Generally speaking,model-free deep reinforcement learning mainly includes value function-based methods and policy-based methods.The former learns policy indirectly by learning the optimal action value function,and the latter directly searches for the optimal policy in the parameterized policy space.These two methods can be merged into an Actor-Critic method.Based on the Actor-Critic framework,this paper conducts in-depth research proximal policy optimization,balance of exploration and exploitation,and value function estimation error.The main work is as follows:(1)A method to solve the negative optimization in the proximal policy optimization is proposed.In the proximal policy optimization,there exists a negative optimization problem.Although the minimum operation of the proximal policy optimization can alleviate this problem,the too slow "escape speed" makes it difficult for the algorithm to escape from the wrong optimization within a finite number of updates.Aiming at this problem,this paper proposes a fast proximal policy optimization algorithm,which uses two acceleration techniques,linearpulling and quadratic-pulling,to increase the gradient weight of samples suffering from negative optimization,and correct the fused gradient direction to a reasonable optimization direction.Extensive experiments on the classic discrete control tasks and continuous control tasks based on Mu Jo Co are conducted in the paper.The experimental results prove the effectiveness of the fast proximal policy optimization algorithm proposed in this paper.(2)A method of balancing exploration and exploitation is proposed.Generally speaking,balanced exploration and exploitation is a very critical issue in the policy gradient algorithm.A reasonable balance of exploration and exploitation can improve the performance of the algorithm.This paper first increases the exploration of the policy network by adding entropy to the value function,and then introduces distributional reinforcement learning into the policy gradient algorithm,and uses the variance information in the distributional reinforcement learning to link the intrinsic uncertainty with the risk tendency of future policy.This article uses a hyperparameter to measure parameter uncertainty,and based on this hyperparameter to determine the risk tendency of future policy,to achieve the purpose of balancing exploration and exploitation.Through this method of balanced exploration and exploitation,the algorithm has achieved sota on multiple Mu Jo Co continuous control tasks.(3)A method to solve the overestimation of the value function of the policy gradient algorithm is proposed.In the policy gradient algorithm,the accuracy of the value function estimation is very important to obtain the optimal policy,but the value function estimation often suffers from an overestimation problem,and overestimation will affect the learning of the optimal policy.We introduce distributional reinforcement learning to the policy gradient algorithm,and propose a max-min estimator.The max-min estimator contains multiple estimators,and the estimator outputs multiple quantiles to measure the distribution of the return.By taking the minimum value for each quantile,the overestimation problem is alleviated.This algorithm has achieved sota on some Mu Jo Co continuous control tasks.
Keywords/Search Tags:reinforcement learning, policy gradient, proximal policy optimization, exploration and exploitation, max-min
PDF Full Text Request
Related items