Font Size: a A A

Research On Non-parametric Function Approximation Methods In Continuous Spaces

Posted on:2015-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:W W ZhuFull Text:PDF
GTID:2268330428998560Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Reinforcement learning is a trial-and-error learning method and can be used to solvemodel-free problems, which means that, in the absence of any prior knowledge, the Agentimplements a learning process based on their own experience from constant interactionwith the environment. This paper studies the continuous state and action spaces problem.The traditional method to solve the problem is discretizing the state or action spaces, but inorder to ensure a certain precision, discretization method will inevitably generate a verylarge state or action spaces, which would result in “curse of dimensionality” problem.Based on the architecture of Actor-Critic, this paper proposes three Actor-Critic algorithmswhere critics use non-parametric function approximation to solve the “curse ofdimensionality” problem under the continuous state space, and actors use the policygradient to find the action.(1) In order to solve the problem of low sample-efficiency in existing non-parametricmethods, we propose a kernel-based recursive least squares AC algorithm. Actor presents akernel-based policy gradient algorithm which uses a kernel function to approximate thereal Q-Value when estimating the value of the policy gradient. Critic presents anALD-based KRLSTD-Q algorithm which can make the most of sample information whileeliminating matrix inversion. The availability of the algorithm can be verified by thesimulation experiments of Mountain Car.(2) In view of the effectiveness of Guassian kernel function, we propose a leastsquares support vector regression AC algorithm. Actor uses the policy gradient algorithm,moreover, in order to make this method feasible, we propose a method which can workbetween the sample sets of policy evaluation and policy improvement compatibly. And wecan obtain the data dictionary using the ALD sparsification method in the sample set of policy evaluation. The regression model of V-Value function can be calculated by using theLSSVR method on data dictionary, while the policy can be improved on the sample set ofpolicy improvement.(3) In order to solve the problem that the above two offline algorithms are notreal-time, we propose an online GPTD-AC algorithm. Actor presents an online policygradient algorithm which can adapt to the growth of the kernels, thus it is suitable for thenon-parametric algorithm of online learning. Critic uses the online GPTD algorithm totimely evaluate the action generated by the actor.
Keywords/Search Tags:Reinforcement Learning, Non-parametric Function Approximation, Actor-Critic, Policy Gradient, Least Squares
PDF Full Text Request
Related items