Research On Value Function Overestimation For Deep Q-Network

Posted on:2021-02-07

Degree:Master

Type:Thesis

Country:China

Candidate:J J Wu

Full Text:PDF

GTID:2428330605976507

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In the field of Artificial Intelligence,deep reinforcement learning has obtained exceptional achievements in recent years.In tasks of large-scale continuous state-space,deep reinforcement learning methods have successfully handled the "dimensional disaster"problem existing in traditional reinforcement learning methods.Deep Q-Network has the problem of overestimating the action values,resulting in poor algorithm performance.Double Deep Q-Network tries to separate the selection and evaluation of actions,to a certain extent,alleviating the problem of overestimation of Deep Q-Network,but the phenomenon of underestimating action values still appears.This article is devoted to ameliorating the accuracy of target value calculation.For the overestimation of Deep Q-Network and the underestimation of Double Deep Q-Network,the following three studies are conducted:i.The weighted double estimator is introduced into Deep Q-Network,and the weight is used to find the balance between Deep Q-Network and Double Deep Q-Network.In order to further improve the accuracy of the target value calculation,the average value idea in Averaged-DQN is improved,and an average method based on the temporal difference error is proposed to calculate the target value.Finally,the weighted double estimator is combined with an improved average method,and an Averaged Weighted Double Deep Q-Network algorithm is proposed.Through several sets of experiments,it can be proved that the performance of this algorithm on the Atari 2600 platform has been greatly improved.ii.To further alleviate the problems of overestimation inherent in Deep Q-Network and underestimation inherent in Double Deep Q-Network,the on-policy methods in reinforcement learning are used instead of the Q-learning algorithm to compute the values of two networks,and use the weight to calculate the target value.Different from Q-learning,the on-policy methods have a stronger convergence guarantee,and they also have potential advantages in online updates.In order to improve the accuracy of the target value calculation,a Weighted Double Deep Q(?)-Network is proposed.In this algorithm,the component values of the evaluation network and the target network are calculated by a linear combination of Sarsa and Expected Sarsa,respectively.Finally,the component values are combined using weights to obtain the final target value.This algorithm is proved by multiple Atari 2600 games that it has superiority and stability.iii.Although the on-policy methods can ensure convergence,there are cases where the exploration is insufficient.The off-policy methods use the behavior policy that are inconsistent with the target policy,enabling agent to discover more important state information.Therefore,in order to further alleviate the overestimation problem in Deep Q-Network,a Double Deep Q-Network algorithm based on the combination of the off-policy and on-policy algorithms is proposed.The algorithm uses an off-policy method to calculate the component values for the evaluation network,and uses an on-policy method to calculate the component values for the target network.Then,the target value is obtained by weighting the two component values,and the performance of this algorithm is verified in a series of video game tasks.This article mainly focuses on deep reinforcement learning algorithms based on value functions.It studies the problem of overestimation of Deep Q-Network and aims to improve the accuracy of target value calculation and the performance of the algorithms.

Keywords/Search Tags:

deep reinforcement learning, target value estimation, temporal difference error, double estimator, off-policy

PDF Full Text Request

Related items

1	Research On Optimization Methods Of The Experience Replay Mechanism For Off-policy Reinforcement Learning
2	Research On Weight Update Method In Temporal Difference Algorithm
3	Model-based Off-policy Optimization
4	Research On Regularized Least Squares Policy Evaluation Algorithms In Reinforcement Learning
5	Recursive Least-squares Reinforcement Learning Based On An Improved Extreme Learning Machine
6	Research On Motion Planning In Dynamic Environment Based On Deep Reinforcement Learning
7	Research On Agent Decision-making And Control Based On Deep Reinforcement Learning
8	Identification And Internal Model Control Of Fractional Order Systems
9	Research On Multiagent Cooperation And Applications Based On Reinforcement Learning
10	Research On Accelerating The Convergence Of Off-policy Temporal Difference Learning