Optimization On Deep Reinforcement Learning Based On Policy Gradient

Posted on:2022-04-01

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Zhong

Full Text:PDF

GTID:2518306524980859

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Reinforcement learning is an important branch in the field of machine learning.It learns action strategies by simulating the learning thinking mode of the brain of a living body.Different from traditional learning methods,the agent is not directly told what ac-tions need to be taken when learning,but gets the immediate rewards brought by these ac-tions to learn a strategy to maximize these rewards in reinforcement learning.Through the two elements of �trial and error� and �delayed reward�,reinforcement learning can han-dle many high-interaction and decision-making problems that are difficult to handle with traditional machine learning.Deep reinforcement learning applies deep neural networks to it on the basis of reinforcement learning.This can solve the problem that traditional reinforcement learning can only be applied to data with small action space and sample space.Among them,the deep reinforcement learning algorithm based on deterministic policy gradient solves the data problem of continuous action space.Among such algo-rithms,the most famous is the DDPG algorithm.However,the DDPG algorithm has the problem of overestimating the cumulative return value of the action taken and the problem of excessive fluctuation of the strategy during the training process.The TD3 algorithm is an improvement of the DDPG algorithm,but it is still inaccurate in estimating the cumu-lative return value corresponding to the action taken,and the strategy of the TD3 training process also fluctuates greatly.Aiming at the above two problems,two improved algorithms are proposed in this dissertation respectively.Aiming at the problem that the cumulative return value is still inaccurately estimated,this dissertation proposes the ?-TD3 algorithm.The TD3 algo-rithm uses the minimum of two Q networks to improve the overestimation.However,this approach may lead to underestimation.In the ?-TD3 algorithm,this dissertation defines an evaluation index ? to judge whether the current situation is overestimated or underes-timated based on the TD error.The two Q values are weighted by the value of ?to the estimate.? can adjust its own value adaptively through the rewards in training.Aiming at the problem of excessive strategy fluctuations,this dissertation proposes a strategy co-ordination algorithm TD3++ based on TD3.The algorithm uses two strategic network collaborative selection strategies to select actions with higher reward expectations.In ad-dition,dropout is added to the strategic network.These two improvements can solve the TD3 strategy fluctuation problem and increase the stability of the strategy.This dissertation conducted simulation experiments of ?-TD3 and TD3++ on Mo-Ju Co continuous motion control tasks.Experimental results show that the above two im-proved algorithms have achieved good results.

Keywords/Search Tags:

Reinforcement Learning, Deep Reinforcement Learning, Policy Gradient, DDPG, TD3

PDF Full Text Request

Related items

1	Gait Analysis Of Quadruped Robot Based On Deep Reinforcement Learning
2	Research On Multiagent Cooperation And Applications Based On Reinforcement Learning
3	Research On Agent Decision-making And Control Based On Deep Reinforcement Learning
4	Research On Off-policy Reinforcement Learning Algorithm
5	Aero-engine Intelligent Control Based On Reinforcement Learning
6	Research On Fast Policy Gradient Algorithms Of Reinforcement Learning Based On Adaptive Learning Rate
7	Deep Deterministic Policy Gradient Based On Entropy Regularization And Regular Update
8	Deep Reinforcement Learning Based On Policy Gradient Optimization And Its Application In Agent Control
9	Study On Emergency Escape Route Planning Based On Reinforcement Learning
10	Study Of Robot Arm Control Based On Deep Reinforcement Learning