Font Size: a A A

Off-policy Policy Iteration Algorithm With State Distribution Ratio

Posted on:2022-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhouFull Text:PDF
GTID:2518306479993139Subject:Statistics
Abstract/Summary:PDF Full Text Request
Reinforcement learning is an important branch of machine learning.It gains immediate rewards by interacting with the environment,and the goal is to maximize the expected return.Reinforcement learning can be divided into on-policy learning and off-policy learning according to whether the behavior policy of collecting samples is the same as the target policy.When dealing with reinforcement learning problems,off policy methods are more general and can be widely used in many practical problems.In recent years,most scholars focus on off-policy policy evaluation problem,which is the fundation of off-policy policy learning.When we study the off-policy optimization problem,off-policy evaluation is a key step in policy improvement.This paper mainly studies the offpolicy optimization problem of Markov decision process.That is,under the condition of fixed samples collected by known behavior policy,we learn a new policy with good performance.In order to solve this problem,this paper proposes a new off-policy iterative algorithm.The algorithm follows the idea of generalized policy iteration(GPI),which includes two steps: policy evaluation and policy improvement.In the policy evaluation step,based on the off policy evaluation method proposed by Liu et al(2018)[20] in recent research work,this paper estimates the value function of the target policy by correcting the mismatch of the state distribution under the target policy and the behavior policy,namely,using the density ratio of two stationary state distributions to replace the cumulative importance sampling ratio in the trajectory space.This method avoids the problem of high variance with exponential growth of trajectory horizon in the previous importance sampling methods,and can be used in Markov decision-making problems with long trajectory horizon.In addition,in order to verify the good empirical properties of the algorithm,we have simulated the algorithm and compared it with the previous off-policy learning algorithms without correcting the difference of state distribution.The results show that the new algorithm proposed in this paper can learn a new policy with good performance under the finite samples collected by the known behavior policy,and has more stable performance than the previous algorithm without correcting the difference of state distribution.
Keywords/Search Tags:off-policy, state distribution ratio, GPI
PDF Full Text Request
Related items