Off-policy Temporal Difference Algorithm With Distributed Adaptability

Posted on:2024-06-07

Degree:Master

Type:Thesis

Country:China

Candidate:M K Zhang

Full Text:PDF

GTID:2568307067993879

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

Reinforcement learning is currently an important research topic in the field of machine learning,focusing on how agents seek strategies to maximize expected cumulative rewards during continuous trial and error interactions with the environment.In research methods in the field of reinforcement learning,when the behavioral strategy that generates sample data during the learning process is the same as the target strategy to be evaluated,it is called an on policy learning method,and when the two are different,it is called an off policy learning method.Compared with the on-policy learning method,the off-policy learning method has a wider range of applications in practical problems,and has become one of the research hotspots in the field of reinforcement learning.Among the commonly used off-policy learning methods in the field of reinforcement learning,the most research scholars pay more attention to the evaluation of offpolicy strategies,and off-policy learning is based on the evaluation of off-policy strategies.Temporal difference algorithm is a common method in off-policy policy evaluation.When the state space is large,the function approximation method is usually used.However,in the off-policy case,the temporal difference algorithm based on the function approximation will diverge.In order to solve this problem,some popular temporal difference algorithms,such as gradient temporal difference algorithm and emphasis temporal difference algorithm,have been proposed successively,but the gradient temporal difference algorithm faced the problem of strict convergence conditions and slow convergence speed.This paper explores the specific reasons for the divergence of the temporal difference algorithm based on function approximation in the case of off-policy,and proposes an off-policy temporal difference algorithm with distributed adaptability,called DAOPTD(λ)algorithm.The algorithm adjusts the distribution difference between the target strategy and behavior strategy by introducing the variable of the stationary state distribution ratio of the target strategy and behavior strategy.And the distribution of off-policy sample data can be adapted to the distribution of on-policy sample data,which solves the divergence problem of the off-policy temporal difference algorithm based on linear function approximation,and gives the proof of the convergence of the algorithm under relatively loose assumptions.At the same time,under certain assumptions,the finite-time error bound of the error moment of the algorithm is derived,which further promotes the development of off-policy learning in the field of reinforcement learning.In addition,in order to verify the actual empirical nature of the algorithm,this paper explores the impacts of temperature parameter and weight on the performance of the algorithm proposed in this paper through experimental simulation in the 7-star counterexample of Baird,a commonly used experimental environment in the field of reinforcement learning,and compared the algorithm with the experimental results of gradient temporal difference algorithm and emphasis temporal difference algorithm under the same conditions.The simulation results show that the algorithm proposed in this paper performs better.

Keywords/Search Tags:

Off-policy learning, Function approximation, Temporal difference algorithm, Stationary state distribution ratio

PDF Full Text Request

Related items

1	Research On Temporal Difference Algorithm Based On Kernel Function Approximation
2	Research On Accelerating The Convergence Of Off-policy Temporal Difference Learning
3	Research On Weight Update Method In Temporal Difference Algorithm
4	Research On Regularized Least Squares Policy Evaluation Algorithms In Reinforcement Learning
5	Off-policy Policy Iteration Algorithm With State Distribution Ratio
6	Research On Reinforcement Learning Based On Value Function Approximation And State Space Decomposition
7	Research On Reninforcement Learning Network Algorithm With Self-adaptive Basis Function
8	Research On Learning Of The Optimal Policy In Largescale State Space
9	Research On Multiagent Cooperation And Applications Based On Reinforcement Learning
10	Analysis And Research On Off-policy Algorithms In Reinforcement Learning