Font Size: a A A

Research On Reinforcement Learning Methods Based On Bias-correction Of Value Function Estimation

Posted on:2020-12-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:P L LvFull Text:PDF
GTID:1368330623456034Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Reinforcement learning is an important method for solving Markov decision process problems.The research of reinforcement learning has achieved abundant results,especially since the emergence of deep reinforcement learning,reinforcement learning has been applied successfully in many fields.One important branch of it is reinforcement learning based on value function.A large number of classical algorithms,such as deep Q network,have emerged.In the process of solving the action value function iteratively,the problem of estimating the maximum expected action value function is always involved,which is accompanied by the problem of estimation bias.This problem also exists in other areas of machine learning.Therefore,the accurate estimation of the maximum expected value is a very important problem.The dissertation study bias-correction of value function estimation.The main contents are as follows:(1)Aiming at the overestimation of Q algorithm and underestimation of DQ algorithm,we analysis the causes of estimation bias of the maximum expected value,and put forward the corresponding methods for bias correction.Firstly,we define the order estimator and study its estimation bias.Each existed methods of estimating the maximum expected value can be regarded as a combination of some order estimators.On this basis,the reasons for overestimation and underestimation of various estimation methods are analyzed,focusing on the advantages and disadvantages of maximum estimator and double estimator.Finally,it is concluded that only using one order estimator to participate in the updating of the value function will lead to different degrees of estimation bias,and the controlled random combination of multiple order estimators can effectively correct the estimation bias.The part is the theoretical base of follow-up research.(2)In order to construct an unbiased estimator,a bias correction reinforcement learning method based on integrated double estimator is proposed from the point of view of randomness and controllability.Firstly,an integrated double estimator is designed,and is proved theoretically that there are suitable parameters to make the estimator unbiased.Secondly,we use the integrated double estimator to update the reinforcement learning value function.The integrated double Q learning algorithm and the integrated double deep Q network algorithm are proposed respectively.The unbiased and convergence of the integrated double Q learning algorithm are proved theoretically.The new algorithm is based on a double estimation framework with stochastic characteristics.It uses maximum estimation to evaluate actions while controllably adding double estimation to avoid overestimation or underestimation when using a single estimator.(3)Focus on the cause of estimation bias caused by "designated" value function,starting from the point of stochastic selection,the estimation of maximum expected action value is regarded as a "selection" problem of estimators,and a bias correction reinforcement learning method based on stochastic selected estimation strategy is proposed.Firstly,a stochastic selection estimator is designed,and its unbiasness is proved theoretically.Secondly,the new estimator is applied to the design of reinforcement learning algorithm,and double Q learning and double deep Q learning based on stochastic selected estimation strategy are proposed.The key parameters of the new algorithm are studied,and the calculation formulas of parameters are designed for the two cases of expectation known and expectation unknown.Finally,from the point of view of random selection of episode,double deep Q reinforcement learning based on episodic stochastic selected estimation strategy is proposed and simulated.(4)Exploration and utilization of action has always been the key problem of reinforcement learning.Agents should not only make full use of maximum action,but also explore potential optimal action.Through analysis,the uncertainty of value function estimation is similar to that of action selection.We should not only use the existing optimal value function,but also explore the unknown value function to correct the estimation deviation.Therefore,inspired by the exploration and utilization of actions,the estimating of maximum expected value is transformed into the problem of effective exploring of the value function.From the new view of effective exploration of value function,a Bayesian deep Q network based on value function exploration reward is proposed.Aiming at capturing the uncertainty of value function,an exploration bonus item for value function is constructed by using Bayesian linear regression in the last layer of deep Q network.By adding this bonus item to the original value function,a new value function with exploratory characteristics is constructed.When estimating the maximum expected value,the new value function is used to select the action,and the original value function is used to estimate the maximum expected value.The proposed algorithm combines action exploration and value function exploration,effectively balancing the estimation bias.(5)In Bayesian deep Q network,the action used to calculate the target value comes from the random sampling of posterior distribution,which results in the calculation of the target value with great fluctuation.In order to increase the stability of Bayesian deep Q network,the integrated double estimator method and stochastic selection strategy are used in calculating the target value in Bayesian deep Q network,and the posterior distribution mean is used to calculate the target value,which improves the stability of the target value.Integrated Bayesian deep Q network based on integrated double estimator and stochastic Bayesian deep Q network based on stochastic selection strategy are proposed respectively.The simulation results on the grid world and Atari games show that the proposed algorithms can effectively eliminate the estimation bias of the value function,improve the learning performance and stabilize the learning process.This dissertation has 28 figures,5 tables and 114 references.
Keywords/Search Tags:reinforcement learning, deep reinforcement learning, value function, estimation of maximum expected value, estimation bias, exploration strategy
PDF Full Text Request
Related items