Font Size: a A A

Research On Policy Iteration Algorithm Within Bayesian Reinforcement Learning

Posted on:2017-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:S H YouFull Text:PDF
GTID:2308330488461934Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Bayesian reinforcement learning that utilizes probability distributions to model for value functions, policies and environment models based on Bayesian technique, is used to solve lots of reinforcement learning tasks. Its main method is that uncertainty is expressed by a prior distribution over unknown parameters and learning is achieved by computing a posterior distribution based on the data observed. In hence, this paper, in the framework of policy iteration algorithm, proposes three modified reinforcement learning algorithms that are on the basis of Bayesian inference and policy iteration. The main research is outlined as follows:i. Specific to traditional Bayesian reinforcement learning methods cannot control the times of model learning dynamically when environment model is unknown, this paper puts forward an improved policy iteration algorithm based on Bayesian intelligent model learning. On the one hand, the algorithm utilizes the variance of Dirichlet distribution for determining when to learn and relearn the model in the part of model learning. On the other hand, for policy learning, the algorithm chooses actions according to the current mean estimate of the state transition under the estimated model plus an additional reward bonus which is to ensure exploration. Moreover, exploration actions can make an agent traverse each state-action pair in the process of model learning. Model learning and policy learning play a part together, and make the algorithm converge to a near-optimal policy.ii. Specific to traditional reinforcement learning methods cannot solve the problem of the exploration/exploitation tradeoff efficiently, this paper proposes an asynchronous policy iteration algorithm based on probability estimates over action-value functions( Q-values). For policy evaluation, the algorithm adopts Gaussian-gamma distributions to estimate Q-values and compute the posterior distributions about Q-values based on the data observed. For policy improvement, the algorithm computes the value of perfect information(VPI) according to the posterior distribution to select the appropriate action to balance exploration and exploitation. In final, this algorithm computes the action-value functions with respect to the optimal policy by asynchronous policy iteration method in order to improve the convergence rate.iii. Specific to traditional policy iteration methods cannot solve the MDP problems with continuous state space and unknown environment model efficiently, this paper presents an on-line policy iteration algorithm based on Gaussian process and temporal difference. The algorithm takes advantage of Gaussian process and temporal difference formula to model for the action-value functions, and obtains the posterior distributions of the action-value functions by Bayesian inference. During the learning process, because the algorithm is an on-line learning method, it can evaluate the improved policy without delay and in time. To some extent, this algorithm can complete the reinforcement learning task in continuous state space and has the faster convergence rate.
Keywords/Search Tags:Bayesian reinforcement learning, policy iteration, model learning, Gaussian-gamma distribution, Gaussian process
PDF Full Text Request
Related items