Font Size: a A A

Analysis And Research On Off-policy Algorithms In Reinforcement Learning

Posted on:2015-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q M FuFull Text:PDF
GTID:1268330428498160Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Reinforcement learning is a kind of learning method, which interacts with theenvironment in order to find the most optimal policy with the maximal expectedaccumulated reward. According to the equivalence of the behavior policy and the targetpolicy in the learning process, reinforcement learning algorithms can be divided into twomain parts: on-policy algorithms and off-policy algorithms. Compared with the on-policyalgorithms, off-policy algorithms can provide a much wider application range, andnowadays, the research related to the off-policy algorithms has been more and morepopular. With respect to the main problems, such as non-convergence, slow convergencerate and low convergence accuracy, in off-policy algorithms, the paper provides a series ofsolutions, which mainly include the following four parts:(1) Proposed a novel off Policy Q (λ)algorithm based on Linear FunctionApproximation, which introduces associated importance factor, uses associated importancefactor to unify the on-policy and off-policy on sample data distribution in iteration process,and assures the convergence. Under the premise of sample data consistency, the paper gavethe proof of the convergence for the algorithm.(2) From the aspect of the TD Error, the paper defined the N-order TD Error, used itin the traditional Q(λ) algorithm, and put forward a fast Q(λ) algorithm based on thesecond-order TD Error. The algorithm adjusts the Q value with the second-order TD Errorand broadcast the TD Error to the whole state-action space, which speed up theconvergence of the algorithm. In addition, the paper analyzed the convergence rate,andunder the condition of one-step update, the result shows that the number of iteration mainlydepends on11γ, ε1.(3) Proposed to transfer the value function between different similar learning taskswith the same state space and action space, which tries to reduce the needed samples in thetarget task and speed up the convergence rate. Based on the framework of off-policyQ-Learning algorithm, combined with the value function transfer method, this paper putforward a novel fast Q-Learning algorithm based on the value function transfer— VFT-Q-Learning. At the beginning, the algorithm uses Bisimulation metric to measure thedistance between states in target task and historical task on the condition that these twotasks have the same state space and action space, transfers the value function if the distancemeets some condition, and finally executes the learning algorithm.(4) In allusion to the problem of balancing the exploration and exploitation in thelarge or continuous state space, the paper put forward a novel off-policy approximatepolicy iteration algorithm based on Gaussian process. The algorithm uses Gaussian processto model the action-value function, and combined with associated importance factor toconstruct generative model, get the posteriori distribution of the parameter vector of theaction-value function by Bayesian inference. During the learning process, according to theposteriori distribution, compute the value of perfect information, and combined with theexpected value of the action-value function, we can select the appropriate action. To acertain extent, the algorithm can balance the exploration and exploitation in learningprocess, and accelerate the convergence.
Keywords/Search Tags:Reinforcement Leaning, Off-policy, Function approximation, Bisimulationmetric, Value function transfer, Policy iteration, Bayesian Inference
PDF Full Text Request
Related items