Font Size: a A A

NDP Optimization For Large-scale Markov Systems Based On Performance Potentials-learning

Posted on:2006-06-25Degree:MasterType:Thesis
Country:ChinaCandidate:J B YuanFull Text:PDF
GTID:2168360152490391Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Many sequential decision problems, such as flexible manufacturing systems, traffic command systems and queuing systems etc., can be modeled as Markov decision processes (MDPs). Motivated by the needs of the application, the optimization of MDPs has been one of research focuses in the control field. As a complex DEDS, there usually exists especially the curse of dimensionality and the curse of model, the problems of MDP's management and control can not be solved by regular methods. Performance potentials theory provides a unified framework for MDP's optimization. Therefore, this thesis focuses on the study of a class of MDPs' optimization based on performance potentials.Because traditional theoretical methods such as policy iteration and value iteration can usually not be used to optimize large-scale systems, we rely on simulation methods. By the simulation of a sample path, and the approximation to performance potentials via neural networks based on reinforcement learning (RL), the systems optimization methods are provided. Two RL's methods, i.e., Monte-Carlo and temporal differences (TD) learning, are considered in the thesis. We give the NDP optimization algorithms for MDPs on the basis of Monte-Carlo and TD learning respectively. Especially, policy iteration algorithms based on simulation of a sample path and a neuro-policy iteration algorithm are presented, and the performance error bound is derived as there are the approximation error and the performance improvement error of each iteration step. With the critic model of neuro-dynamic programming methodology, we discuss parameterized TD(0) learning rules and parameter-updating formula for both average-criteria and discounted-criteria problems respectively, and derive the NDP optimization algorithms on the basis of TD(0) learning. Then, by simulating a sample path, we introduce the unified TD(0) learning formula of potentials, and develop a unified NDP optimization approach based on parameterized TD(0) learning.For a semi-Markov decision process (SMDP), which is widely appeared in the real world, we define an alpha-uniformized Markov chain by the concept of equivalent infinitesimal generator. Then the optimization of the SMDP can be transformed into the chain according to the relations of their performance measures and Markov performance potentials. Then, the optimization application of SMDPs with both average- and discounted- criteria is discussed.Finally, a numerical example about a SMDP is provided, which demonstrates the optimization methods provided in this thesis. All obtained results may be applicable to a general of Markov systems or semi-Markov systems.
Keywords/Search Tags:Markov decision processes, Performance potentials, Neuro-dynamic programming, Reinforcement learning
PDF Full Text Request
Related items