Exdloratory Action Correction Algorithm Based On Actor-Critic

Posted on:2020-06-02

Degree:Master

Type:Thesis

Country:China

Candidate:Y B Jiang

Full Text:PDF

GTID:2428330578479411

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Deep reinforcement learning extracts high-dimensional data features through deep learn-ing,combined with reinforcement learning algorithms,and can handle complex large-scale continuous state space tasks without preprocessing input data.The actor-critic algorithm is one of the core algorithms in deep reinforcement learning.When the action is selected by the actor,the exploration action is usually added to prevent the algorithm from performing local optimization.However,the exploration action will lead to the underestimation of the value function,thus affecting the convergence rate of algorithm,and the randomness of the exploration action will lead to insufficient stability of the algorithm.In this paper,a series of actor-critic algorithms are proposed for the inaccurate value function,poor convergence performance and poor stability caused by the exploration action.The main research can be summarized as the following three aspects:(1)For the discrete action task,the actor-critic algorithm will produce the underestima-tion and convergence instability of the value function due to the use of the maximum entropy regular term.Based on the poximal policy optimization method,the maximum entropy cor-rection algorithm is proposed.The state action value function is constructed by using the state value function and the policy function existing in the network,and the maximum en-tropy correction term is obtained by using the Bellman optimal equation.The advantages of the maximum entropy correction are analyzed theoretically,and the effectiveness of the algorithm is verified by corresponding experiments.(2)In the continuous action task,the actor-critic algorithm greatly increases the select-ed probability of boundary actions due to the Gaussian distribution policy,which reduces the stability of the algorithm.Combining the importance sampling mechanism with the general-ized advantage estimator,this paper proposes an importance sampling advantage estimator and derives the corresponding update formula.The new estimator uses the importance sam-pling to limit the update of boundary actions and improve the consistency of the value func-tion and the policy.At the same time,the convergence rate of the algorithm is accelerated.This paper verifies the effectiveness of the algorithm through experiments in the multi-joint robot experimental platform.(3)For the continuous control task,the importance sampling advantage estimator will hinder the convergence of the policy to boundary actions.Based on the importance sampling advantage estimator,the clipped action policy gradient is introduced into the algorithm,and a clipped action policy gradient algorithm combining importance sampling is proposed.By changing the gradient of the boundary action,the algorithm improves the convergence speed of the importance sampling advantage estimator in boundary actions and its performance in boundary actions optimal task.The effectiveness of the algorithm is verified by comparison experiments.

Keywords/Search Tags:

deep reinforcement learning, actor-critic, policy gradient, trust region optimization

PDF Full Text Request

Related items

1	Research On Fast Policy Gradient Algorithms Of Reinforcement Learning Based On Adaptive Learning Rate
2	Research On Multiagent Cooperation And Applications Based On Reinforcement Learning
3	Research On The Quantitative Trading Strategy Based On Deep Policy Gradient Methods
4	Research On Deterministic Policy Gradient Algorithms With Continuous Control Task
5	Aero-engine Intelligent Control Based On Reinforcement Learning
6	Deep Reinforcement Learning Based On Policy Gradient Optimization And Its Application In Agent Control
7	Reaearch On Deep Reinforcement Learning Algorithm In Continuous Action On Space
8	Research On Agent Decision-making And Control Based On Deep Reinforcement Learning
9	Research On Target Tracking Algorithm Based On Deep Learning And Reinforcement Learning
10	Research On Non-parametric Function Approximation Methods In Continuous Spaces