Font Size: a A A

Research On Reinforcement Learning In Continuous Spaces

Posted on:2017-12-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Y ZhangFull Text:PDF
GTID:1318330512488092Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As an important methodology for solving sequential decision problems,reinforcement learning has received more and more research interest because of its special mechanism of autonomous learning.Although it has made considerable progress in recent years,it often suffers from some challenges such as the curse of dimensionality and inefficient learning,when it deals with the practical problems which have continuous state and action spaces.Therefore,this dissertation studies on reinforcement learning in continuous spaces.The main work and contributions are summarized as follows:(1)Many temporal difference(TD)algorithms based on linear and local approximation can't represent the continuous state space adaptively or solve the continuous policy accurately.To overcome both problems,an incremental TD learning framework with nearest neighbors is proposed,and some schemes for its key components are presented.This framework is based on that the value functions and polices of neighbor states are often similar.It selects some observed states to construct the sparse dictionary online,and uses the locally weighted learning to approximate the value function and the continuous policy.Besides the discrete policy,it can also learn the continuous policy.The theoretical analysis and simulation results show that the framework is not only simple,efficient,open,easy to understand and so on,but has a reliable convergence guarantee.(2)The kernel least-squares TD algorithms can't sparsify the state space online and don't consider regularization.To overcome both problems,three online sparse kernel recursive least-squares TD(RLSTD)algorithms are proposed,called OSKRLSTD-L2,OSKRLSTD-L1and OSMKRLSTD-L2respectively.All of these algorithms use the optimization of Bellman projection operator,online sparsification,regularization,RLS and the sliding-window technology,which not only can simplify the derivation,represent the state space automatically,avoid over-fitting and alleviate the influence of noise,but can reduce the computational cost and the memory cost.Moreover,the OSKRLSTD-L1algorithm embeds a fixed-point sub-iteration and online pruning sub-algorithm,which can make 1regularization be implemented online and obtain a sparser representation of the state space.The OSMKRLSTD-L2algorithm also introduces the multi-kernel leastsquares technology for the first time,which further improves its approximation ability.(3)Many actor-critic(AC)algorithms are often difficult to obtain good convergence speed and quality in the continuous action space.Through the analysis of the limitations of the traditional Gaussian policy,it shows that this problem is mainly caused by the fact that the policy is lack of greediness.On this basis,an AC algorithm framework based on symmetric perturbation sampling is proposed for solving the problems which have one dimensional continuous action space.At each time step,this framework generates two candidate actions with symmetric Gaussian perturbations,and takes them to interact with the environment in parallel.Then,the framework selects the behavior action greedily and updates the value-function parameters based on their maximum TD error,and updates the policy parameters based on their average regular or natural policy gradient.After that,the space and time complexities of the framework are analyzed,the convergence of all four AC algorithms integrated in the framework is proved,and the effectiveness of each new AC algorithms is demonstrated.However,the framework needs two interactions with the environment at each time step.To over this problem,an (?)-greedy Gaussian policy and two matched compatible AC algorithm frameworks are also proposed.This policy combines the (?)-greedy policy and the traditional Gaussian policy for the first time.At each time step,it generates 2Ncandidate actions by using symmetric perturbations in an N-dimensional continuous action space,and uses the (?)-greedy policy to select the behaviour action based on the advantage function.Finally,the performance of the policy and compatible frameworks is analyzed theoretically and is demonstrated experimentally.(4)In reinforcement learning,many algorithms often use the constant scalar stepsize,which makes their performance difficult to be improved.To overcome this problem,based on the re-interpretation of the RLSTD algorithm according to second-order gradient descent,an adaptive vector step-size is proposed and applied to the linear TD(0),Sarsa and Q-learning algorithms.In the derivation of this step-size,it uses the diagonal matrix to replace the original correlation matrix and introduce a variable forgetting factor,which can not only reduce the computational complexity but adapt to the change of policy learning.The theoretical analysis and simulation results show that the proposed step-size inherits the ability of history sample reuse of the RLSTD algorithm,improves the convergence quality of the above three algorithms efficiently,and almost makes the TD error gradually converge to 0 with probability 1 at each time step.
Keywords/Search Tags:reinforcement learning, continuous space, function approximation, locally weighted learning, kernel recursive least squares, symmetric Gaussian perturbation, online sparsification, adaptive vector step-size
PDF Full Text Request
Related items