Policy evaluation and control learning are two main tasks in reinforcement learning(RL)problems.The policy evaluation process is to estimate a value function that predicts the expected accumulated rewards under a fixed policy after being in a particular state.In recent years,a large number of advanced TD-based algorithms with value function approximation(VFA)have been proposed.These improvements contain high efficiency,parameter regularization,generalizing with eligibility traces,and off-policy learning.Therefore,the thesis focuses on efficient regularized online policy evaluation algorithms.Research contents of this thesis consists of the following three parts:(1)A novel policy evaluation algorithm called least squares temporal difference with gradient correction(LS-TDC)and its kernel-based version called kernel-based LS-TDC(KLS-TDC)are proposed.LS-TDC minimizes the mean-square projected Bellman error in the framework of least squares TD with gradient information,and this algorithm can thus obtain better convergence performance.For KLS-TDC,the kernel approximate linear dependence analysis is performed to realize kernel sparsification.In addition,a policy iteration strategy based on KLS-TDC is proposed for control learning problems.The convergence and parameter sensitivities of both LS-TDC and KLS-TDC are tested through on-policy learning,off-policy learning,and control learning benchmarks.(2)Based on recursive least squares technique,an O(n2)counterpart of LS-TDC termed RC(Recursive least squares temporal difference with gradient correction)is then proposed.To improve data-efficiency further,RC algorithm is generalized with eligibility traces.An off-policy extension is also proposed based on importance sampling method.For regularization for the rigorous meansing of objective function,an l2-regularized RC algorithm termed RRC(Regularized RC)is proposed.An additional recursive step is used to achieve a different effect from traditional recursive least-square-based k2-regularized algorithm:the regularization term does not decrease throughout learning.Additionally,a fast counterpart algorithm with O(n2)complexity is also proposed,termed FRRC(Fast RRC),which is more practical online algorithm than RRC.The convergence analysis and experiments results demonstrate the significant performances of RRC and FRRC.The convergence analysis for the proposed RC as well as LS-TDC,and RRC/FRRC have been given.Empirical results in both on-policy and off-policy benchmarks show that the proposed algorithms have substantial advantages.(3)A sparse proximal recursive least squares temporal difference with gradient correction(termed l1-RC)is proposed based on li-regularization to realize feature selection function for RC.l1-RC solves the nested optimization problem consisting of MSPBE and l1-penalty with complexity O(n2)for each time-step.In this algorithm,RC with iterative refinement is used to solve the operator error step,and ADMM with proximal operator is used to solve the fixed-point error step.These two steps are combined as the proposed l1-RC,and some extensions also has been given.The convergence analysis of l1-RC is based on ODE method,and the empirical results in benchmarks show that the proposed algorithm can make sparse solution and perform better that the state-of-the-art first-order algorithms. |