Off-policy Policy Iteration Algorithm With State Distribution Ratio

Posted on:2022-05-19

Degree:Master

Type:Thesis

Country:China

Candidate:T Zhou

Full Text:PDF

GTID:2518306479993139

Subject:Statistics

Abstract/Summary:

Reinforcement learning is an important branch of machine learning.It gains immediate rewards by interacting with the environment,and the goal is to maximize the expected return.Reinforcement learning can be divided into on-policy learning and off-policy learning according to whether the behavior policy of collecting samples is the same as the target policy.When dealing with reinforcement learning problems,off policy methods are more general and can be widely used in many practical problems.In recent years,most scholars focus on off-policy policy evaluation problem,which is the fundation of off-policy policy learning.When we study the off-policy optimization problem,off-policy evaluation is a key step in policy improvement.This paper mainly studies the offpolicy optimization problem of Markov decision process.That is,under the condition of fixed samples collected by known behavior policy,we learn a new policy with good performance.In order to solve this problem,this paper proposes a new off-policy iterative algorithm.The algorithm follows the idea of generalized policy iteration(GPI),which includes two steps: policy evaluation and policy improvement.In the policy evaluation step,based on the off policy evaluation method proposed by Liu et al(2018)[20] in recent research work,this paper estimates the value function of the target policy by correcting the mismatch of the state distribution under the target policy and the behavior policy,namely,using the density ratio of two stationary state distributions to replace the cumulative importance sampling ratio in the trajectory space.This method avoids the problem of high variance with exponential growth of trajectory horizon in the previous importance sampling methods,and can be used in Markov decision-making problems with long trajectory horizon.In addition,in order to verify the good empirical properties of the algorithm,we have simulated the algorithm and compared it with the previous off-policy learning algorithms without correcting the difference of state distribution.The results show that the new algorithm proposed in this paper can learn a new policy with good performance under the finite samples collected by the known behavior policy,and has more stable performance than the previous algorithm without correcting the difference of state distribution.

Keywords/Search Tags:

off-policy, state distribution ratio, GPI

Related items

1	Off-policy Temporal Difference Algorithm With Distributed Adaptability
2	Research Of Network Security Policy Monitoring Model And Key Technologies
3	Research On Mining And Validation Method Of Policy Genome Based On The State Space Reduction
4	The Research On The Dynamic Policy Selection And Distribution Mechanism
5	Analysis Of The Impact Of News-driven Monetary Policy Friction On Bank’s Non-performing Loan Ratio
6	Research Of Parametric Conversion Source Based Decoy-state Quantum Key Distribution
7	Research On The Practical Security Of The Decoy State Quantum Key Distribution
8	Research On Matching Policy In D2D Video Data Distribution Systems
9	Distribution And Implementation Of Distributed Firewall Policies
10	Study On Distribution Characteristics And Internal Mechanism Of Policy Documents Altmetrics Indicator