Font Size: a A A

Research On Nonparametric Value Function Approximation Reinforcement Learning

Posted on:2019-01-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:T JiFull Text:PDF
GTID:1368330545474339Subject:Mechanical and electrical engineering
Abstract/Summary:PDF Full Text Request
Value function approximation is one of the main ideas of applying classical reinforcement learning to large-scale and continuous state space,and it forms the research direction of value function approximation reinforcement learning.However,at present,it also has some problems such as slow convergence speed,large calculation quantity and weak adaptive ability.In particular,the generalization structure and their related parameters of most value function approximation algorithms depend on prior knowledge or repeated trial and error,which makes the algorithm strong domain correlation.If the artificial generalization bias do not match with the problem,the algorithm can not converge correctly and the usability is poor.In view of the above problems,based on the existing research results,a series of nonparametric value function approximation reinforcement learning algorithm is proposed.?1?Proposed nonparametric approximation policy iteration reinforcement learning based on CMAC?NPAPIRL-CMAC?.The algorithm designs a new network structure of the reinforcement learning based on CMAC,and defining its working mechanism.In the algorithm,samples and generalization parameter are achieved automatically by the FUNSample algorithm;quantized coding structure is constructed automatically by the FUNT&E and the FUNBI algorithm;average learning rate is computed automatically based on the collection of the quantized coding structure unit's sample sizes;value function approximator's parameters and quantized coding structure are updated automatically based on delta rule;the whole algorithm's online computing power is improved by the generalized policy iteration.The simulation results on the balancing control of a single inverted pendulum show the effectiveness,robustness and rapid convergence ability of the proposed algorithm is obtained within allowable error rates under different values when the activation interval of single quantized coding structure is 1 and 2 respectively.?2?Proposed nonparametric approximation generalized policy iteration reinforcement learning based on states clustering?NPAGPIRL-SC?.The algorithm improves the network structure of reinforcement learning based on FRBF,and defines its working mechanism.In the algorithm,samples are collected automatically by the FUNSampleample algorithm;initial state basis function and its adaptive adjustment parameters are constructed automatically by FUNBase algorithm;value function approximator's parameters and state basis function are updated automatically based on delta rule;the whole algorithm's online computing power is improved by the generalized policy iteration.The simulation results on the balancing control of a single inverted pendulum show the effectiveness,robustness and rapid convergence ability of the proposed algorithm under different discrete actions and different reinforcement learning allowable error rates.?3?Proposed nonparametric approximation policy iteration parallel reinforcement learning?NPAPIRL-P?.The algorithm designs a new parallel reinforcement learning network structure and corresponding parallel learning methods.Its single learning unit is implemented by NPAGPIRL-SC algorithm.The simulation results on the balancing control of a single inverted pendulum show the effectiveness and robustness of the proposed algorithm under different discrete actions and different reinforcement learning allowable error rates.The simulation also focuses on the algorithm's ability of balancing the speed-up ratio and its efficiency,and test results show that the NPAPIRL-P algorithm has good parallel acceleration performance compared with the experimental data of NPAGPIRL-SC algorithm.?4?Proposed nonparametric approximation policy iteration reinforcement learning based on Dyna framework?NPAPIRL-Dyna?.On the basis of NPAGPIRL-SC algorithm,the algorithm is improved,which is reflected in the following aspects.First,internal state transition matrix D is introduced to the network structure,using the successive features of time to make up the deficiency of determining the membership of the input state to each state basis function only based on euclidean distance from space.Second,the topological features of the environment are described based on visit frequency of state basis functions and the environmental estimation model B and B?are constructed,and the learning and planning process is integrated organically based on Dyna framework model identification.Third,the adaptive adjustment ability of the algorithm's network structure and parameters is further strengthened,including the operation of adding structure,merging structure,adjusting parameters and so on.The simulation results on the balancing control of a single inverted pendulum show the effectiveness and robustness of the proposed algorithm is obtained under different reinforcement learning allowable error rates,and the comparison with the NPAGPIRL-SC algorithm proves that the model-based planning process is helpful to improve the efficiency and accuracy of the algorithm.
Keywords/Search Tags:reinforcement learning, value function approximation, nonparametric, policy iteration, CMAC, Dyna framework, inverted pendulum
PDF Full Text Request
Related items