Research On Deep Reinforcement Learning Based On Cumulative Error Correction

Posted on:2023-12-13

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y Gu

Full Text:PDF

GTID:1528307055457364

Subject:Control theory and control engineering

Abstract/Summary:

PDF Full Text Request

Deep reinforcement learning is the combination of deep learning and reinforcement learning.It has strong perception and decision-making ability,and is expected to promote the development of general artificial intelligence.However,reinforcement learning evolved from dynamic programming and requires multiple iterations to learn optimal policy.The deep neural network has poor interpretability,and the learning effect is strongly dependent on the samples.Therefore,if the exploration is not sufficient and insufficient experience data is not collected,the value estimation of deep reinforcement learning will produce deviation and the data distribution will shift.Furthermore,under the influence of estimation deviation and experience data distribution deviation,the training process of deep reinforcement learning will accumulate errors,which will seriously affect the policy updating stability and learning efficiency of deep reinforcement learning algorithm.Therefore,this dissertation focuses on how to reduce the cumulative error of deep reinforcement learning.The main contents include:(1)The function approximation-based value iteration will accumulate bellman residual and may be updated in the direction away from the optimal value function.In order to restrain the cumulative error generated in the process of training,a feasible solution is to reduce the single step update error.Based on this idea,the value function of the previous iteration is introduced into the optimization process of the current value function,and the contribution of the two to a single iteration is adjusted through the transform matrix.Then,using the bellman operator,error analysis and mathematical induction,the transform matrix that minimizes the single step update error is solved.Based on the solved transformation matrix,an approximate policy acceleration(APA)value iteration method is proposed.It is theoretically proved that this method can use a more aggressive learning rate while ensuring convergence.Finally,combining the proposed value iteration method with off-policy deep reinforcement learning algorithms,APA-based deep Q network,APA-based double deep Q network and APAbased deep deterministic policy gradient are proposed.(2)In the Actor-Critic architecture,the actor network optimizes the policy based on the output of the critic network,but the policy does not directly participate in the iterative update of the value function.Therefore,the response of the critic network to the changes in the actor network will lag behind and further lead to the instability of policy.On the other hand,since the target value of on-policy critic network is not the expected discount reward induced by behavior policy,errors will accumulate during the training process.Therefore,the policy is introduced into the updating process of the value function,and the policy based expected(PBE)discount reward and value iteration are proposed.By analyzing the expected SARSA and the expected discount reward of trajectories induced by behavior policy,it is proved that the PBE value iteration can effectively reduce the estimation error in on-policy value iteration.Then,based on the theoretical research of discount factor and policy monotone optimization,a policy update method with clipped discount factor is proposed to ensure that the policy update meets the unbiased estimation of trust domain when PBE discount reward are applied.Finally,an Actor-Critic architecture with policy feedback is designed,and the proximal policy optimization with policy feedback is further proposed.(3)Induced by the same behavior policy,the experience data collected by agents are highly correlated,resulting in low sample efficiency.To solve this problem,with the correlation between martingale and reinforcement learning,a sub-martingale criterion is proposed to judge whether the experience data is conducive to policy optimization.By analyzing the on-policy value iteration process,it is found that the underestimate of the state value will increase the estimation error and reduce the learning efficiency of the algorithm.In order to reduce the cumulative error caused by the undervaluation of the state value,a new method,called Advanced Value Iteration(AVI),is proposed.By analyzing the estimation error of on-policy value function,it is proved that using AVI for the experience data meeting the sub-martingale criterion is more conducive to the training process.Furthermore,an anti-martingale reinforcement learning framework is established and the anti-martingale proximal policy optimization is proposed.(4)For offline reinforcement learning,fixed offline experience buffer can avoid the system risk associated with random exploration.However,the number of outliers or abnormal samples in the offline experience buffer is small,which leads to deviation in the estimation of these state values,thus increasing the error in the offline update gradient.Therefore,the prioritized experience replay is introduced into offline reinforcement learning,and an offline prioritized experience(OPE)sampling model is proposed to reduce the value function estimation error.On the other hand,during the whole training process,offline updating gradient will be affected by the experience data that is not conducive to policy optimization.Therefore,with the theoretical analysis of the value iteration process and martingale,a martingale based offline prioritized experience(MOPE)sampling model is proposed.MOPE sampling model can reduce the cumulative error caused by multiple sampling of empirical data that is not conducive to strategy optimization.Furthermore,the two sampling models are combined with batch constrained Q learning(BCQ)to propose prioritized BCQ and martingale based prioritized BCQ.The simulation results on Atari games,robot control tasks,and auto driving simulator show that the proposed algorithms can make deep reinforcement learning algorithms of their respective research area more sample efficient.The dissertation has 22 figures,7 tables and 152 references.

Keywords/Search Tags:

deep reinforcement learning, cumulative error, value iteration, policy gradient, martingale, sampling model

PDF Full Text Request

Related items

1	Research On Agent Decision-making And Control Based On Deep Reinforcement Learning
2	Deep Reinforcement Learning Based On Policy Gradient Optimization And Its Application In Agent Control
3	Research On Strategy Model Based On Deep Reinforcement Learning And Its Application
4	Research On Fast Policy Gradient Algorithms Of Reinforcement Learning Based On Adaptive Learning Rate
5	Robust Policy Gadient Algorithm Based On Actor-Critic In Deep Reinforcement Learning
6	Deep Deterministic Policy Gradient Based On Entropy Regularization And Regular Update
7	Optimization On Deep Reinforcement Learning Based On Policy Gradient
8	Research On Multiagent Cooperation And Applications Based On Reinforcement Learning
9	Research On Motion Control Of Mobile Robots Based On Reinforcement Learning
10	Research On Policy Iteration Algorithm Within Bayesian Reinforcement Learning