Font Size: a A A

Performance Potential-based NDP Optimization Approaches And Application Research For SMDP

Posted on:2007-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:D ChenFull Text:PDF
GTID:2178360182986281Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Discrete event dynamic systems (DEDSs) are generally existing man-made systems in the real world, and semi-Markov decision process (SMDP) is one of the major models for such systems. To meet with the optimizing control requirements of large state DEDS, we focus on the application of neuro-dynamic programming(NDP) to the SMDP performance optimization based on reinforcement learning (RL).Performance potential plays an important role in the performance analysis and research of SMDP. Based on performance potential theorem and Bellman optimality equation, it is easy to establish optimality equation, which we call performance potential-based Bellman optimality equation, for both average-cost and discounted-cost performance criteria. We can use theory methods, such as value iteration and policy iteration, to solve the equation, and the system obtains the control information using look-up table method. Considering that modern DEDS is generally complex system, when we use theory method to solve the equation, memory storage will save much information which consequently occupys much space, and the memory even overflows under the condition that large invertible computations occur. And so the" curse of dimensionality" problem may arise, which lead to the problems insoluble. We may consider simulation-based method as an alternative one to solve the former problems. NDP methods based on RL are effective ones that can work. Such methods use some function approximation or neural architecture to represent performance function or policy, and require comparatively small space to save less parameter information.There are three methods in NDP corresponding to critic, actor and actor-critic, respectively. We centers on actor and actor-critic methods in the paper. In both methods, we transform SMDP to equivalent MDP, and then to a uniformized equivalent chain so as to perform performance optimization using uniformized equivalent chain optimizing theories. In actor method, we use neural network to approximate policy. Based on simple sample path of uniformized chain, we first perform performance potential learning which result is used as the approximat policy evaluation of the actor algorithms;then, based on the estimates of performance potential, we make the approximate policy updating , and improve the network parameters, hence the policy. In network training, we provide two parameter-improving methods: steepest descent method and sample training method. In actor-critic method, we use two neural networks to separately represent potential and policy. Policy evaluation and policy updating are similar to those in actor method, only differ in the aspect that we perform potential parameterized learning using steepest method based on simple sample path simulation. Meanwhile, we present a unified TD(λ) learning formula for both discounted and average criteria in two methods. In addition, a numerical example is provided to demonstrate the effectiveness of the corresponding algorithms in two methods.Call admission control (CAC) problem is very common in tele-communication networks, and the CAC problem for a single link can be formulated as a continues-time Markov decision problem. Similarly, the large state space in CAC system may lead to curse of dimensionality. We apply NDP methods to CAC, and study the former three methods in the optimization control of CAC. In the three methods, performance potential learning or parameterized learning is similar to those in chapter 3 and chapter 4. And in actor and actor-critic methods, the policy that the network approximates is produced randomly, that is, the policy is random policy. Thesimulation results show the effectiveness of the NDP methods in CAC.
Keywords/Search Tags:semi-Markov decision process, performance potential, uniformized chain, reinforcement learning, neuro-dynamic programming
PDF Full Text Request
Related items