Font Size: a A A

Research On Value Function In Deep Reinforcement Learning

Posted on:2022-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:2518306533472344Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Most deep reinforcement learning algorithms use the deep neural network approximation function as the expected return,and agents select actions or update policy according to the value estimation of different states.Therefore,how to optimize the learning in the value function of deep reinforcement learning is the key to improve the performance of reinforcement learning algorithm.However,the existing updating methods of value functions still have problems such as long training time,slow convergence rate and prone to estimation bias.Aiming at the above problems,this thesis studies the following contents:(1)In the off-policy deep reinforcement learning algorithm,the value function is generally learned iteratively by bootstrapping,and its iterative process can be regarded as a fixed-point iterative method.When using the approximate value function of deep neural network to solve the problem of high state dimension,this iterative learning method may lead to slow convergence rate of value function.In order to accelerate the convergence rate of value function and improve the learning efficiency of reinforcement learning,Steffensen iteration method is introduced into the process of value iteration,Steffensen value iteration is proposed,and its convergence and convergence rate are analyzed theoretically in this thesis.Furthermore,this thesis proposes two off-policy deep reinforcement learning algorithms: SVI-DDQN and SVITD3,which use Steffensen value iteration to accelerate convergence.(2)Most of the existing deep reinforcement learning algorithms use one-step immediate reward to calculate the return estimation as the update target of the value function.Although the use of multi-step return can more accurately represent the estimated value of return,there may be problems of non convergence and large variance.Therefore,this thesis analyzes the reason why the multi-step deep reinforcement learning algorithm is difficult to converge.Furthermore,adaptive coefficients are used to adjust the weights of multiple multi-step return estimates,and a Twin Delayed Deep Deterministic Policy Gradient algorithm based on multi-step adaptive correction is proposed.The algorithm regards the action selected by the target policy network with noise as the Q-value maximum action,and sets the adaptive correction coefficients of the two adjacent return estimators according to the distance between the action in the trajectory and the action with the largest Q-value.Finally,obtains the corrected return estimator as the update target of the value function through the weighted sum of multiple multi-step return estimators.(3)The existing on-policy policy gradient algorithms mostly use the Generalized Advantage Estimation to improve policy.Although it can balance the bias and variance of the algorithm to a certain extent,Generalized Advantage Estimation is biased due to the introduction of estimates of multiple value functions.On the other hand,the Monte Carlo policy gradient is unbiased,but the variance is large,which affects the convergence of the algorithm.Therefore,this thesis proposes a Mixed Advantage Estimation method,which weights Generalized Advantage Estimation and the Monte Carlo return with baseline to balance the bias and variance of the algorithm.At the same time,in order to reduce the bias of value function estimation as the baseline,this algorithm uses the mean discounted returns of different sequences at the same time as the baseline of Monte Carlo advantage estimation.Experiments on MDPtoolbox,Atari 2600 game,Mu Jo Co robot control and Box2 D platform show that the algorithm proposed in this thesis can effectively improve the learning efficiency of the agent and obtain higher returns in multiple tasks.There are 20 figures,16 tables and 100 references in this thesis.
Keywords/Search Tags:deep reinforcement learning, value function, accelerating convergence, multi-step learning, advantage estimation
PDF Full Text Request
Related items