Font Size: a A A

Optimization On Deep Reinforcement Learning Based On Policy Gradient

Posted on:2022-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ZhongFull Text:PDF
GTID:2518306524980859Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Reinforcement learning is an important branch in the field of machine learning.It learns action strategies by simulating the learning thinking mode of the brain of a living body.Different from traditional learning methods,the agent is not directly told what ac-tions need to be taken when learning,but gets the immediate rewards brought by these ac-tions to learn a strategy to maximize these rewards in reinforcement learning.Through the two elements of ”trial and error” and ”delayed reward”,reinforcement learning can han-dle many high-interaction and decision-making problems that are difficult to handle with traditional machine learning.Deep reinforcement learning applies deep neural networks to it on the basis of reinforcement learning.This can solve the problem that traditional reinforcement learning can only be applied to data with small action space and sample space.Among them,the deep reinforcement learning algorithm based on deterministic policy gradient solves the data problem of continuous action space.Among such algo-rithms,the most famous is the DDPG algorithm.However,the DDPG algorithm has the problem of overestimating the cumulative return value of the action taken and the problem of excessive fluctuation of the strategy during the training process.The TD3 algorithm is an improvement of the DDPG algorithm,but it is still inaccurate in estimating the cumu-lative return value corresponding to the action taken,and the strategy of the TD3 training process also fluctuates greatly.Aiming at the above two problems,two improved algorithms are proposed in this dissertation respectively.Aiming at the problem that the cumulative return value is still inaccurately estimated,this dissertation proposes the ?-TD3 algorithm.The TD3 algo-rithm uses the minimum of two Q networks to improve the overestimation.However,this approach may lead to underestimation.In the ?-TD3 algorithm,this dissertation defines an evaluation index ? to judge whether the current situation is overestimated or underes-timated based on the TD error.The two Q values are weighted by the value of ?to the estimate.? can adjust its own value adaptively through the rewards in training.Aiming at the problem of excessive strategy fluctuations,this dissertation proposes a strategy co-ordination algorithm TD3++ based on TD3.The algorithm uses two strategic network collaborative selection strategies to select actions with higher reward expectations.In ad-dition,dropout is added to the strategic network.These two improvements can solve the TD3 strategy fluctuation problem and increase the stability of the strategy.This dissertation conducted simulation experiments of ?-TD3 and TD3++ on Mo-Ju Co continuous motion control tasks.Experimental results show that the above two im-proved algorithms have achieved good results.
Keywords/Search Tags:Reinforcement Learning, Deep Reinforcement Learning, Policy Gradient, DDPG, TD3
PDF Full Text Request
Related items