Font Size: a A A

Research On Accelerating The Convergence Of Off-policy Temporal Difference Learning

Posted on:2021-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:B HeFull Text:PDF
GTID:2428330605974892Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Reinforcement learning is a branch of machine learning that finds optimal policies by maximizing the expected cumulative reward obtained by an agent.The dilemma of exploration and exploitation is often faced in the process of finding optimal policies.In this regard,the off-policy methods are proposed in reinforcement learning to deal with this dilemma.However,the most off-policy Temporal Difference methods(TD)ignore the discrepancy between the behavior and the target policy,so problems such as divergence,learning wrong policies,and slow learning may occur.To solve these problems,this paper proposes three different off-policy TD methods using state distribution correction,which can be summarized as the following three parts:i.When solving the off-policy evaluation problem,the off-policy TD methods have been known to diverge with function approximation.The direct cause of this problem is the discrepancy between the behavior and the target policy.Aiming at this problem,a state distribution correction method,which is combined with the TD(?)methods,is proposed according to that ergodic Markov chains have a unique stationary distribution.Finally,it is experimentally demonstrated that the proposed off-policy TD method can solve the problem caused by the mismatch between the distributions.ii.Importance sampling is an essential part of the off-policy TD methods.However,the high variance problem of the importance sampling method is very common and slows the convergence.Aiming at this problem,a state distribution correction method using experience replay is proposed.This method can avoid the importance sampling ratio being directly used to update the value function,thereby solving the high variance problem.It can also solve problems caused by the discrepancy between the state distributions.iii.Aiming at the problems that the off-policy TD methods may diverge or have a slow convergence rate,a method of directly correcting the state distribution using the state distribution ratio is proposed,and the time complexity and space complexity are better than the previously proposed methods.Finally,to eliminate the influence of the artificially constructed experimental environment on the fairness of the experiment,a randomly constructed MDP experiment was used.The experimental results show that the method not only converges but also converges faster than the off-policy methods which are commonly used.
Keywords/Search Tags:reinforcement learning, off-policy TD methods, stationary distribution, accelerating convergence
PDF Full Text Request
Related items