Font Size: a A A

Study Of Reinforcement Learning Algorithms Based On Value Function Approximation

Posted on:2014-09-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:X G ChenFull Text:PDF
GTID:1268330425468267Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, more and more machine learning researchers focus on reinforce-ment learning. On reinforcement learning problems with both a small scale state space and a small scale action space, classic value table-based reinforcement learning algo-rithms are proved by mathematics for convergence, and well evaluated on the experi-mental performance.However, in practice, reinforcement learning problems are usually with large scale and/or continuous state space, even with continuous action space, e.g., steering con-trol problems in automatic driving. This brings the "curse of dimensionality", which challenges the classic table-based reinforcement learning algorithms on both memory space and learning efficiency. A common solution is to combine the classic reinforce-ment learning algorithms with function approximation methods in order to enhance the abstraction ability and generalization ability on state space and action space.In the aspect of function approximation, the main work and contributions of this thesis are as follows:(1) A short introduction to reinforcement learning model is given. Then, sur-veys about the linear function approximated reinforcement learning algorithms and the kernel-based reinforcement learning algorithms are summarized.(2) Temporal Difference (TD) learning family tries to learn a least-squares solu-tion of an approximate Linear Value Function (LVF). However, due to the representive ability of the features in LVF, the predictive error of the learned LVF is bounded by the residual between the optimal value function and the projected optimal value function, where the projection operator is Π=Φ(ΦT DΦ)-1ΦΤ D. We find that the predictive error can be very large if the feature function Φ is not well designed. To deal with this problem, Temporal Difference learning with Piecewise Linear Value Function (PLVF-TD) is proposed to further decrease the error bounds. In PLVF-TD, there are two steps:(i) building the piecewise linear basis for problems with differ-ent dimensions;(ii) learning the parameters via temporal difference learning, whose complexity is O(n). The error bounds are proved to decrease to zero when the size of the piecewise basis goes into infinite.(3) Different from linear approximated reinforcement learning, kernel-based re-inforcement learning has a strong representive ability because of the Representer The-orem. However, the typical kernel-based reinforcement learning algorithms can not satisfy online learning both on accuracy and complexity.To deal with this problem, an algorithm named Online Selective Kernel-based Temporal Difference (OSKTD) learning is proposed. OSKTD includes two online procedures:online sparsification and parameter updating for the selective kernel-based value function. A new sparsification method (i.e. a kernel distance-based online spar-sification method) is proposed based on selective ensemble learning, which is com-putationally less complex compared with other sparsification methods. Based on the proposed sparsification method, the sparsified dictionary of samples is constructed on-line by checking if a sample needs to be added to the sparsified dictionary. Also, based on local validity, a selective kernel-based value function is proposed to select the best samples from the sample dictionary for the selective kernel-based value function ap-proximator. The parameters of the selective kernel-based value function are iteratively updated by using the temporal difference learning algorithm combined with the gra-dient descent technique. The complexity of the online sparsification procedure in the OSKTD algorithm is O(n).(4) Real-world problems often require learning algorithms to deal with both con-tinuous state and continuous action spaces in order to control accurately. Thus, contin-uous action space reinforcement learning problems become a hot research topic.To deal with this problem, Actor-Critic methods are combined with the kernel methods, because that Actor-Critic methods are good at dealing with continuous action space problem while the kernel methods are good at dealing with continuous state space problem. Thus, Kernel-based Continuous-action Actor Critic Learning (KCACL) is proposed, where the actor updates the probability of each action based on reward-inaction, and the critic updates the state value function based on the Online Selective Kernel-based Temporal Difference (OSKTD) learning. The empirical results demonstrate the effectiveness of all the proposed algorithms.
Keywords/Search Tags:Reinforcement Learning, Function Approximation, Kernel Methods, Piece-wise Linear Basis, Selective Ensemble Learning, Local Validity, Actor Critic Learning
PDF Full Text Request
Related items