Font Size: a A A

Research On Maximization Bias Corrected Off-Policy Algorithms In Reinforcement Learning

Posted on:2020-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z H HuFull Text:PDF
GTID:2428330578480896Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Reinforcement learning is an important branch of machine learning.It gains reward signals by interacting with the environment in order to find the optimal policy with maximal cumulative reward.According to whether the target policy and the behavior policy are the same,reinforcement learning algorithms can be divided into on-policy algorithms and off-policy algorithms.Off-policy methods evaluate or improve a policy which is different from the policy used to generate the data.Off-policy algorithms are fast to compute and easy to implement,and have been used in a wide range of application.Q-learning is a popular off-policy temporal-difference control algorithm.In some stochastic environments,such as in optimal control problems with highly random rewards and high discount factor,Q-Learning algorithm induces high levels of statistical error.The reason for this phenomenon is the positive bias caused by using the maximum estimated values as an estimate of the true maximum value.The positive bias is called maximization bias,which influence the quality of the behavior learned by the agents and the convergence rate of the algorithm.Aiming at the above problems,this paper proposes three different off-policy algorithms to correct maximization bias.The main research includes the following three parts:i.Research on off-policy maximization bias corrected algorithm based on generalized form of Q-Learning.To solve the problem that Q-Learning algorithm produces the maxi-mization bias when solving the optimal control problems,the paper presents an accumulated form of Q-Learning update rule.From the practical point of view,the paper proves the gen-eration of maximization bias and analyzes the reasons why the performance of Q-learning will be affected by positive bias.We generalize the new form for the purpose of easy adapta-tions.This paper presents a new off-policy maximization bias corrected algorithm based on generalized form of Q-Learning.By using the current maximal value instead of bias terms to correct the action value function,we construct a new estimator to correct the action-value function.As a result,it reduces the influence of the overestimation and improves the con-vergence rate and the accuracy of the value function.ii.Research on eligibility traces oriented off-policy maximization bias corrected algo-rithm.In allusion to the problem of using eligibility traces method to solve temporal credit assignment in large-discrete state-action spaces tasks exacerbates maximization bias,this paper presents an eligibility traces oriented off-policy maximization bias corrected algorith-m.The algorithm uses the more accurate TD error obtained by the improved estimator,and broadcast the current estimate value through eligibility traces method to the entire value function space.It improves the data utilization and accelerates the correction of the value function.Thus,the convergence performance of the algorithm is improved.iii.Research on Dyna structure oriented off-policy maximization bias corrected algo-rithm.In some complex tasks which enviroment model is available,Dyna-Q algorithm deeply integrate learning process and planning process to improve the data utilization.How-ever,Dyna-Q algorithm produces the maximization bias in complex-discrete state spaces with highly random rewards,which lead to the slow convergence of the algorithm and the difficulty of adapting to the changing environment.The paper presents a Dyna structure ori-ented off-policy maximization bias corrected algorithm.This paper optimizes the estimator used in the update of the value function to reduce the maximization bias generated during the learning process and the planning process,which improves the convergence rate of the algorithm further,so that the algorithm can quickly adapt to environmental model changes.
Keywords/Search Tags:Reinforcement learning, Off-policy, Q-Learning, Maximization bias
PDF Full Text Request
Related items