Analysis And Research On Off-policy Algorithms In Reinforcement Learning

Posted on:2015-01-24

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Q M Fu

Full Text:PDF

GTID:1268330428498160

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Reinforcement learning is a kind of learning method, which interacts with theenvironment in order to find the most optimal policy with the maximal expectedaccumulated reward. According to the equivalence of the behavior policy and the targetpolicy in the learning process, reinforcement learning algorithms can be divided into twomain parts: on-policy algorithms and off-policy algorithms. Compared with the on-policyalgorithms, off-policy algorithms can provide a much wider application range, andnowadays, the research related to the off-policy algorithms has been more and morepopular. With respect to the main problems, such as non-convergence, slow convergencerate and low convergence accuracy, in off-policy algorithms, the paper provides a series ofsolutions, which mainly include the following four parts:(1) Proposed a novel off Policy Q (λ)algorithm based on Linear FunctionApproximation, which introduces associated importance factor, uses associated importancefactor to unify the on-policy and off-policy on sample data distribution in iteration process,and assures the convergence. Under the premise of sample data consistency, the paper gavethe proof of the convergence for the algorithm.(2) From the aspect of the TD Error, the paper defined the N-order TD Error, used itin the traditional Q(λ) algorithm, and put forward a fast Q(λ) algorithm based on thesecond-order TD Error. The algorithm adjusts the Q value with the second-order TD Errorand broadcast the TD Error to the whole state-action space, which speed up theconvergence of the algorithm. In addition, the paper analyzed the convergence rate，andunder the condition of one-step update, the result shows that the number of iteration mainlydepends on11γ, ε1.(3) Proposed to transfer the value function between different similar learning taskswith the same state space and action space, which tries to reduce the needed samples in thetarget task and speed up the convergence rate. Based on the framework of off-policyQ-Learning algorithm, combined with the value function transfer method, this paper putforward a novel fast Q-Learning algorithm based on the value function transfer— VFT-Q-Learning. At the beginning, the algorithm uses Bisimulation metric to measure thedistance between states in target task and historical task on the condition that these twotasks have the same state space and action space, transfers the value function if the distancemeets some condition, and finally executes the learning algorithm.(4) In allusion to the problem of balancing the exploration and exploitation in thelarge or continuous state space, the paper put forward a novel off-policy approximatepolicy iteration algorithm based on Gaussian process. The algorithm uses Gaussian processto model the action-value function, and combined with associated importance factor toconstruct generative model, get the posteriori distribution of the parameter vector of theaction-value function by Bayesian inference. During the learning process, according to theposteriori distribution, compute the value of perfect information, and combined with theexpected value of the action-value function, we can select the appropriate action. To acertain extent, the algorithm can balance the exploration and exploitation in learningprocess, and accelerate the convergence.

Keywords/Search Tags:

Reinforcement Leaning, Off-policy, Function approximation, Bisimulationmetric, Value function transfer, Policy iteration, Bayesian Inference

PDF Full Text Request

Related items

1	Research On Multiagent Cooperation And Applications Based On Reinforcement Learning
2	Research On Reninforcement Learning Network Algorithm With Self-adaptive Basis Function
3	Research On Nonparametric Value Function Approximation Reinforcement Learning
4	Research On Reinforcement Learning Based On Value Function Approximation And State Space Decomposition
5	Researches On Reinforcement Learning Algorithm Based On Nonparametric Approximation
6	Research On Regularized Policy Gradient
7	Research On Policy Iteration Algorithm Within Bayesian Reinforcement Learning
8	Research On Application Of Reinforcement Leaning In Graph Adversarial Attack And Policy Transfer
9	Bayesian Methods for Knowledge Transfer and Policy Search in Reinforcement Learning
10	Research On Non-parametric Function Approximation Methods In Continuous Spaces