Font Size: a A A

Exdloratory Action Correction Algorithm Based On Actor-Critic

Posted on:2020-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y B JiangFull Text:PDF
GTID:2428330578479411Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Deep reinforcement learning extracts high-dimensional data features through deep learn-ing,combined with reinforcement learning algorithms,and can handle complex large-scale continuous state space tasks without preprocessing input data.The actor-critic algorithm is one of the core algorithms in deep reinforcement learning.When the action is selected by the actor,the exploration action is usually added to prevent the algorithm from performing local optimization.However,the exploration action will lead to the underestimation of the value function,thus affecting the convergence rate of algorithm,and the randomness of the exploration action will lead to insufficient stability of the algorithm.In this paper,a series of actor-critic algorithms are proposed for the inaccurate value function,poor convergence performance and poor stability caused by the exploration action.The main research can be summarized as the following three aspects:(1)For the discrete action task,the actor-critic algorithm will produce the underestima-tion and convergence instability of the value function due to the use of the maximum entropy regular term.Based on the poximal policy optimization method,the maximum entropy cor-rection algorithm is proposed.The state action value function is constructed by using the state value function and the policy function existing in the network,and the maximum en-tropy correction term is obtained by using the Bellman optimal equation.The advantages of the maximum entropy correction are analyzed theoretically,and the effectiveness of the algorithm is verified by corresponding experiments.(2)In the continuous action task,the actor-critic algorithm greatly increases the select-ed probability of boundary actions due to the Gaussian distribution policy,which reduces the stability of the algorithm.Combining the importance sampling mechanism with the general-ized advantage estimator,this paper proposes an importance sampling advantage estimator and derives the corresponding update formula.The new estimator uses the importance sam-pling to limit the update of boundary actions and improve the consistency of the value func-tion and the policy.At the same time,the convergence rate of the algorithm is accelerated.This paper verifies the effectiveness of the algorithm through experiments in the multi-joint robot experimental platform.(3)For the continuous control task,the importance sampling advantage estimator will hinder the convergence of the policy to boundary actions.Based on the importance sampling advantage estimator,the clipped action policy gradient is introduced into the algorithm,and a clipped action policy gradient algorithm combining importance sampling is proposed.By changing the gradient of the boundary action,the algorithm improves the convergence speed of the importance sampling advantage estimator in boundary actions and its performance in boundary actions optimal task.The effectiveness of the algorithm is verified by comparison experiments.
Keywords/Search Tags:deep reinforcement learning, actor-critic, policy gradient, trust region optimization
PDF Full Text Request
Related items