Research On Accelerating The Convergence Of Off-policy Temporal Difference Learning

Posted on:2021-02-13

Degree:Master

Type:Thesis

Country:China

Candidate:B He

Full Text:PDF

GTID:2428330605974892

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Reinforcement learning is a branch of machine learning that finds optimal policies by maximizing the expected cumulative reward obtained by an agent.The dilemma of exploration and exploitation is often faced in the process of finding optimal policies.In this regard,the off-policy methods are proposed in reinforcement learning to deal with this dilemma.However,the most off-policy Temporal Difference methods(TD)ignore the discrepancy between the behavior and the target policy,so problems such as divergence,learning wrong policies,and slow learning may occur.To solve these problems,this paper proposes three different off-policy TD methods using state distribution correction,which can be summarized as the following three parts:i.When solving the off-policy evaluation problem,the off-policy TD methods have been known to diverge with function approximation.The direct cause of this problem is the discrepancy between the behavior and the target policy.Aiming at this problem,a state distribution correction method,which is combined with the TD(?)methods,is proposed according to that ergodic Markov chains have a unique stationary distribution.Finally,it is experimentally demonstrated that the proposed off-policy TD method can solve the problem caused by the mismatch between the distributions.ii.Importance sampling is an essential part of the off-policy TD methods.However,the high variance problem of the importance sampling method is very common and slows the convergence.Aiming at this problem,a state distribution correction method using experience replay is proposed.This method can avoid the importance sampling ratio being directly used to update the value function,thereby solving the high variance problem.It can also solve problems caused by the discrepancy between the state distributions.iii.Aiming at the problems that the off-policy TD methods may diverge or have a slow convergence rate,a method of directly correcting the state distribution using the state distribution ratio is proposed,and the time complexity and space complexity are better than the previously proposed methods.Finally,to eliminate the influence of the artificially constructed experimental environment on the fairness of the experiment,a randomly constructed MDP experiment was used.The experimental results show that the method not only converges but also converges faster than the off-policy methods which are commonly used.

Keywords/Search Tags:

reinforcement learning, off-policy TD methods, stationary distribution, accelerating convergence

PDF Full Text Request

Related items

1	Research On Multiagent Policy Optimization Based On Deep Reinforcement Learning
2	Research On Policy Gradient Methods Based On Functional Gradients
3	On the convergence of model -free policy iteration algorithms for reinforcement learning: Stochastic approximation under discontinuous mean dynamics
4	Research On Value Function In Deep Reinforcement Learning
5	Research On Reinforcement Learning Methods Based On Direct Policy Search
6	Deep Reinforcement Learning Based On Value Distribution And Diversity Policy
7	Research On Multiagent Cooperation And Applications Based On Reinforcement Learning
8	Research On Policy Iteration Algorithm Within Bayesian Reinforcement Learning
9	Research On Deep Reinforcement Learning Method For Environment With Non-stationary Dynamics
10	Bayesian Methods for Knowledge Transfer and Policy Search in Reinforcement Learning