Font Size: a A A

Research On Twin Delayed Deep Deterministic Policy Gradient Based On Augmented Exploration

Posted on:2023-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:H B ZhangFull Text:PDF
GTID:2568306788966739Subject:Control engineering
Abstract/Summary:PDF Full Text Request
Twin delayed deep deterministic policy gradient is a mainstream algorithm for deep reinforcement learning,and a model-free reinforcement learning that has been successfully applied to challenging continuous control tasks.However,when the reward is sparse or the state space is large in the environment,the sample efficiency of the twin delayed deep deterministic policy gradient is poor and the environment exploration ability is weak.To this end,this thesis studies how to enhance the exploration ability of the twin delayed deep deterministic policy gradient algorithm.The main research contents include:(1)Aiming at the problem of inefficient exploration caused by determining the objective function through the lower bound of the double Q-value function,a twin delayed deep deterministic policy gradient based on optimistic exploration is proposed.First,starting from the double Q value function,it is analyzed that taking the lower bound will make the exploration somewhat pessimistic;Then,the Gaussian function and the piecewise function are used to fit the double Q value function respectively;Finally,an exploration policy is constructed by fitting the Q-value function and the target policy to guide the agent to explore in the environment.The exploration policy can prevent the agent from learning a sub-optimal policy,thus effectively solving the problem of inefficient exploration.(2)Aiming at the problem that the twin delayed deep deterministic policy gradient algorithm cannot learn effectively in a sparse reward environment,a twin delayed deep deterministic policy gradient based on internal exploration reward is proposed.Firstly,the problem of insufficient exploration ability in TD3 is analyzed;Secondly,the error of the eigenvalue of the next state is obtained by using the variational autoencoder and prediction network and used as a short-term exploration reward;Then,the error of the double Q value function is calculated as a long-term exploration reward;Finally,the internal exploration reward generated by the agent during the training process is obtained by weighting the two exploration rewards.The internal exploration reward can promote the agent to explore effectively in the environment and further improve the learning efficiency of the model.This thesis compares the proposed algorithm with the benchmark algorithm on the control platform based on Mu Jo Co physical engine to verify the effectiveness of the proposed algorithm.The experimental results show that the proposed algorithm achieves or exceeds other basic reinforcement learning algorithms in reward,stability and learning speed.There are 23 figures,8 tables,and 90 references in this thesis.
Keywords/Search Tags:deep reinforcement learning, exploration policy, exploration reward, variational autoencoder
PDF Full Text Request
Related items