Font Size: a A A

On the convergence of model -free policy iteration algorithms for reinforcement learning: Stochastic approximation under discontinuous mean dynamics

Posted on:2001-10-03Degree:Ph.DType:Thesis
University:University of Colorado at BoulderCandidate:Williams, John KevinFull Text:PDF
GTID:2468390014951879Subject:Mathematics
Abstract/Summary:
This thesis is motivated by an unsolved problem in the theory of stochastic dynamic programming, or reinforcement learning: that of determining whether certain algorithms are guaranteed to converge to an optimal policy (control) for a Markov decision process having a finite number of states and actions. These algorithms, which include Sarsa(lambda) and Monte Carlo ES, are (1) model-free, operating directly on experience obtained from the environment rather than performing system identification, (2) on-policy, learning only about the policy currently being executed, and (3) multi-step, using sequences of experiences for their updates. The convergence of several other reinforcement learning algorithms, including Q-learning, TD(lambda), and actor-critic methods, has been successfully established using stochastic approximation techniques. However, standard stochastic approximation theory is not directly applicable to the model-free, on-policy, multi-step algorithms mentioned above because their mean iterate dynamics contain jump discontinuities under greedy or rank-based policy selection.;Simplifying the problem, we analyze model-free policy iteration algorithms, including incremental and asynchronous variants, which include Monte Carlo ES as a special case. Using the theory of dynamic programming and Markov decision processes, we establish the mean iterate dynamics for these algorithms under greedy policy selection, characterize their discontinuities, and prove their convergence in the absence of evaluation noise. For the more realistic case in which noise is present, we propose a novel modification of the algorithms, introducing an arbitrarily small "margin for error" which eliminates the possibility of infinitesimal noise causing erroneous policy changes. Using stochastic approximation, probability and martingale theory, we establish bounds on the deviation caused by evaluation noise or unbalanced component updates and provide a novel criterion for convergence under diminishing stepsizes. Two major convergence theorems follow: a "convergence in distribution" of modified fixed-stepsize, synchronous incremental policy iteration and the a.s. convergence of modified Monte-Carlo ES under diminishing-stepsize, asynchronous updates. We generalize these analyses and convergence results to the case of stochastic, rank-based policy selection. Finally, we point out characteristics of the mean iterate dynamics of Sarsa(lambda) which prevent us from immediately applying these results to demonstrate its convergence.
Keywords/Search Tags:Convergence, Reinforcement learning, Stochastic, Policy, Algorithms, Mean iterate dynamics, Theory
Related items