Research On Twin Delayed Deep Deterministic Policy Gradient Based On Augmented Exploration

Posted on:2023-11-10

Degree:Master

Type:Thesis

Country:China

Candidate:H B Zhang

Full Text:PDF

GTID:2568306788966739

Subject:Control engineering

Abstract/Summary:

PDF Full Text Request

Twin delayed deep deterministic policy gradient is a mainstream algorithm for deep reinforcement learning,and a model-free reinforcement learning that has been successfully applied to challenging continuous control tasks.However,when the reward is sparse or the state space is large in the environment,the sample efficiency of the twin delayed deep deterministic policy gradient is poor and the environment exploration ability is weak.To this end,this thesis studies how to enhance the exploration ability of the twin delayed deep deterministic policy gradient algorithm.The main research contents include:(1)Aiming at the problem of inefficient exploration caused by determining the objective function through the lower bound of the double Q-value function,a twin delayed deep deterministic policy gradient based on optimistic exploration is proposed.First,starting from the double Q value function,it is analyzed that taking the lower bound will make the exploration somewhat pessimistic;Then,the Gaussian function and the piecewise function are used to fit the double Q value function respectively;Finally,an exploration policy is constructed by fitting the Q-value function and the target policy to guide the agent to explore in the environment.The exploration policy can prevent the agent from learning a sub-optimal policy,thus effectively solving the problem of inefficient exploration.(2)Aiming at the problem that the twin delayed deep deterministic policy gradient algorithm cannot learn effectively in a sparse reward environment,a twin delayed deep deterministic policy gradient based on internal exploration reward is proposed.Firstly,the problem of insufficient exploration ability in TD3 is analyzed;Secondly,the error of the eigenvalue of the next state is obtained by using the variational autoencoder and prediction network and used as a short-term exploration reward;Then,the error of the double Q value function is calculated as a long-term exploration reward;Finally,the internal exploration reward generated by the agent during the training process is obtained by weighting the two exploration rewards.The internal exploration reward can promote the agent to explore effectively in the environment and further improve the learning efficiency of the model.This thesis compares the proposed algorithm with the benchmark algorithm on the control platform based on Mu Jo Co physical engine to verify the effectiveness of the proposed algorithm.The experimental results show that the proposed algorithm achieves or exceeds other basic reinforcement learning algorithms in reward,stability and learning speed.There are 23 figures,8 tables,and 90 references in this thesis.

Keywords/Search Tags:

deep reinforcement learning, exploration policy, exploration reward, variational autoencoder

PDF Full Text Request

Related items

1	Exploration Strategy Of Deterministic Policy In Deep Reinforcement Learning
2	Research On Exploration-Incentivized Robust Deep Reinforcement Learning
3	Researches On Efficient Exploration Driven By Reward Function
4	Research On Exploration Strategy In Deep Reinforcement Learning
5	Research And Application Of Deep Reinforcenment Learning Algorithms Based On Reward Shaping
6	Research On Efficient Exploration In Reinforcement Learning
7	Research On Efficient Autonomous Exploration Strategy For Mobile Robots Based On Deep Q Network
8	Robust Policy Gadient Algorithm Based On Actor-Critic In Deep Reinforcement Learning
9	Research On Model-based Deep Reinforcement Learning With Active Exploration
10	Research On Exploration Enhancement Deep Reinforcement Learning Methods