NDP Optimization For Large-scale Markov Systems Based On Performance Potentials-learning

Posted on:2006-06-25

Degree:Master

Type:Thesis

Country:China

Candidate:J B Yuan

Full Text:PDF

GTID:2168360152490391

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Many sequential decision problems, such as flexible manufacturing systems, traffic command systems and queuing systems etc., can be modeled as Markov decision processes (MDPs). Motivated by the needs of the application, the optimization of MDPs has been one of research focuses in the control field. As a complex DEDS, there usually exists especially the curse of dimensionality and the curse of model, the problems of MDP's management and control can not be solved by regular methods. Performance potentials theory provides a unified framework for MDP's optimization. Therefore, this thesis focuses on the study of a class of MDPs' optimization based on performance potentials.Because traditional theoretical methods such as policy iteration and value iteration can usually not be used to optimize large-scale systems, we rely on simulation methods. By the simulation of a sample path, and the approximation to performance potentials via neural networks based on reinforcement learning (RL), the systems optimization methods are provided. Two RL's methods, i.e., Monte-Carlo and temporal differences (TD) learning, are considered in the thesis. We give the NDP optimization algorithms for MDPs on the basis of Monte-Carlo and TD learning respectively. Especially, policy iteration algorithms based on simulation of a sample path and a neuro-policy iteration algorithm are presented, and the performance error bound is derived as there are the approximation error and the performance improvement error of each iteration step. With the critic model of neuro-dynamic programming methodology, we discuss parameterized TD(0) learning rules and parameter-updating formula for both average-criteria and discounted-criteria problems respectively, and derive the NDP optimization algorithms on the basis of TD(0) learning. Then, by simulating a sample path, we introduce the unified TD(0) learning formula of potentials, and develop a unified NDP optimization approach based on parameterized TD(0) learning.For a semi-Markov decision process (SMDP), which is widely appeared in the real world, we define an alpha-uniformized Markov chain by the concept of equivalent infinitesimal generator. Then the optimization of the SMDP can be transformed into the chain according to the relations of their performance measures and Markov performance potentials. Then, the optimization application of SMDPs with both average- and discounted- criteria is discussed.Finally, a numerical example about a SMDP is provided, which demonstrates the optimization methods provided in this thesis. All obtained results may be applicable to a general of Markov systems or semi-Markov systems.

Keywords/Search Tags:

Markov decision processes, Performance potentials, Neuro-dynamic programming, Reinforcement learning

PDF Full Text Request

Related items

1	NDP Optimization For Large-scale Markov Systems Based On Performance Potentials-learning
2	Parallel Algorithms For Large-Scale Markov Decision Processes Based On Performance Potentials
3	Performance Potential-based NDP Optimization Approaches And Application Research For SMDP
4	Reinforcement Learning And Its Applications In Navigation And Control Of Mobile Robots
5	Study On The Learning And Planning Algorithm Of Intelligent Agent Based On Performance Potentials
6	Theories, Algortihms And Applications Of Policy Gradient Reinforcement Learning
7	Robust Control For Uncertain Semi-Markov Decision Processes Based On Performance Potentials
8	Inverse Reinforcement Learning And Imitation Learning With Applications In Intelligent Robotics
9	Performance Sensitivity Analysis And Optimization Of Extended Markov Decision Processes
10	Study On The Improved Average Reward Reinforcement Learning Algorithm Based On Performance Potentials