On the convergence of model -free policy iteration algorithms for reinforcement learning: Stochastic approximation under discontinuous mean dynamics

Posted on:2001-10-03

Degree:Ph.D

Type:Thesis

University:University of Colorado at Boulder

Candidate:Williams, John Kevin

Full Text:PDF

GTID:2468390014951879

Subject:Mathematics

Abstract/Summary:

This thesis is motivated by an unsolved problem in the theory of stochastic dynamic programming, or reinforcement learning: that of determining whether certain algorithms are guaranteed to converge to an optimal policy (control) for a Markov decision process having a finite number of states and actions. These algorithms, which include Sarsa(lambda) and Monte Carlo ES, are (1) model-free, operating directly on experience obtained from the environment rather than performing system identification, (2) on-policy, learning only about the policy currently being executed, and (3) multi-step, using sequences of experiences for their updates. The convergence of several other reinforcement learning algorithms, including Q-learning, TD(lambda), and actor-critic methods, has been successfully established using stochastic approximation techniques. However, standard stochastic approximation theory is not directly applicable to the model-free, on-policy, multi-step algorithms mentioned above because their mean iterate dynamics contain jump discontinuities under greedy or rank-based policy selection.;Simplifying the problem, we analyze model-free policy iteration algorithms, including incremental and asynchronous variants, which include Monte Carlo ES as a special case. Using the theory of dynamic programming and Markov decision processes, we establish the mean iterate dynamics for these algorithms under greedy policy selection, characterize their discontinuities, and prove their convergence in the absence of evaluation noise. For the more realistic case in which noise is present, we propose a novel modification of the algorithms, introducing an arbitrarily small "margin for error" which eliminates the possibility of infinitesimal noise causing erroneous policy changes. Using stochastic approximation, probability and martingale theory, we establish bounds on the deviation caused by evaluation noise or unbalanced component updates and provide a novel criterion for convergence under diminishing stepsizes. Two major convergence theorems follow: a "convergence in distribution" of modified fixed-stepsize, synchronous incremental policy iteration and the a.s. convergence of modified Monte-Carlo ES under diminishing-stepsize, asynchronous updates. We generalize these analyses and convergence results to the case of stochastic, rank-based policy selection. Finally, we point out characteristics of the mean iterate dynamics of Sarsa(lambda) which prevent us from immediately applying these results to demonstrate its convergence.

Keywords/Search Tags:

Convergence, Reinforcement learning, Stochastic, Policy, Algorithms, Mean iterate dynamics, Theory

Related items

1	Research On Multiagent Cooperation And Applications Based On Reinforcement Learning
2	Stochastic Algorithms In Deep Reinforcement Learning
3	Research On Accelerating The Convergence Of Off-policy Temporal Difference Learning
4	Exploration Strategy Of Deterministic Policy In Deep Reinforcement Learning
5	Research On Fast Policy Gradient Algorithms Of Reinforcement Learning Based On Adaptive Learning Rate
6	Analysis And Research On Off-policy Algorithms In Reinforcement Learning
7	Deep Reinforcement Learning Based On Policy Gradient Optimization And Its Application In Agent Control
8	Research On Regularized Least Squares Policy Evaluation Algorithms In Reinforcement Learning
9	The Research Of Multi-agent Cooperation Learning Based On Co-evolutionary And Reinforcement Learning
10	Reinforcement learning control with approximation of time-dependent agent dynamics