Font Size: a A A

Off-policy Temporal Difference Algorithm With Distributed Adaptability

Posted on:2024-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:M K ZhangFull Text:PDF
GTID:2568307067993879Subject:Statistics
Abstract/Summary:PDF Full Text Request
Reinforcement learning is currently an important research topic in the field of machine learning,focusing on how agents seek strategies to maximize expected cumulative rewards during continuous trial and error interactions with the environment.In research methods in the field of reinforcement learning,when the behavioral strategy that generates sample data during the learning process is the same as the target strategy to be evaluated,it is called an on policy learning method,and when the two are different,it is called an off policy learning method.Compared with the on-policy learning method,the off-policy learning method has a wider range of applications in practical problems,and has become one of the research hotspots in the field of reinforcement learning.Among the commonly used off-policy learning methods in the field of reinforcement learning,the most research scholars pay more attention to the evaluation of offpolicy strategies,and off-policy learning is based on the evaluation of off-policy strategies.Temporal difference algorithm is a common method in off-policy policy evaluation.When the state space is large,the function approximation method is usually used.However,in the off-policy case,the temporal difference algorithm based on the function approximation will diverge.In order to solve this problem,some popular temporal difference algorithms,such as gradient temporal difference algorithm and emphasis temporal difference algorithm,have been proposed successively,but the gradient temporal difference algorithm faced the problem of strict convergence conditions and slow convergence speed.This paper explores the specific reasons for the divergence of the temporal difference algorithm based on function approximation in the case of off-policy,and proposes an off-policy temporal difference algorithm with distributed adaptability,called DAOPTD(λ)algorithm.The algorithm adjusts the distribution difference between the target strategy and behavior strategy by introducing the variable of the stationary state distribution ratio of the target strategy and behavior strategy.And the distribution of off-policy sample data can be adapted to the distribution of on-policy sample data,which solves the divergence problem of the off-policy temporal difference algorithm based on linear function approximation,and gives the proof of the convergence of the algorithm under relatively loose assumptions.At the same time,under certain assumptions,the finite-time error bound of the error moment of the algorithm is derived,which further promotes the development of off-policy learning in the field of reinforcement learning.In addition,in order to verify the actual empirical nature of the algorithm,this paper explores the impacts of temperature parameter and weight on the performance of the algorithm proposed in this paper through experimental simulation in the 7-star counterexample of Baird,a commonly used experimental environment in the field of reinforcement learning,and compared the algorithm with the experimental results of gradient temporal difference algorithm and emphasis temporal difference algorithm under the same conditions.The simulation results show that the algorithm proposed in this paper performs better.
Keywords/Search Tags:Off-policy learning, Function approximation, Temporal difference algorithm, Stationary state distribution ratio
PDF Full Text Request
Related items