Deep Deterministic Policy Gradient Based On Entropy Regularization And Regular Update | Posted on:2022-03-12 | Degree:Master | Type:Thesis | Country:China | Candidate:S Han | Full Text:PDF | GTID:2518306329461154 | Subject:Computer software and theory | Abstract/Summary: | PDF Full Text Request | Deep Deterministic Policy Gradient(DDPG)algorithm is a widely used reinforcement learning algorithm.While this approach can solve high-dimensional sequential decision problems in continuous action domains to some extent,its performance is unstable and it is inefficient in solving real-world problems.The unstable performance and low efficiency of DDPG algorithm is in part due to its deficiencies in both the exploration of the environment and the utilization of the data.Inadequate exploration of the environment often leads to the inability of the agent to discover key information in the environment,which leads to locally optimal or even leads to learning failure.Inadequate utilization of data often leads to the inability of the agent to learn effective policies using the valid information found during exploration.In this paper,an study of DDPG algorithm is conducted around its exploration and data utilisation problems,and the main outcomes of the study are summarised below:1.This paper presents an entropy regularization method for DDPG algorithm.We use a learnable noise layer to parameterise the output layer of the policy network,and derive the learnable independent entropy and the joint entropy,as well as an easily tractable upper bound on the joint entropy.The entropy regularization method for DDPG algorithm maximises the learnable independent entropy of the noise parameters while maximising the objective function in the early stage of learning to enhance exploration,and indirectly minimises the joint entropy of the noise parameters by minimising the upper bound on the joint entropy in the later stage of learning to facilitate the formation of stable policies.We compare the performance of the method with baselines in a series of experiments in the continuous action domain,and the experimental results show that this method outperforms previous algorithms.In addition,we also analyse the effect of the setting of the regularization factor on the performance of the algorithm,giving parameter settings that perform better in the Mujoco environment.2.This paper proposes a regularly updated deterministic policy gradient(RUD)algorithm for the data utilisation problems.This method abandons the traditional learning paradigm of exploiting while exploring and learns from samples using of centrally sampled updates after focused exploration.This part of the paper begins by theoretically pointing out the underuse of new experiences and the overuse of old experiences under the original training paradigm.Then the paper theoretically prove that the learning procedure with RUD can make better use of new data in replay buffer than the traditional procedure.In addition,the low variance of the Q value in RUD is more suitable for the current Clipped Double Q-learning strategy.We design a comparison experiment against previous methods,an ablation experiment with the original DDPG algorithm,and other analytical experiments in Mujoco environments.The experimental results demonstrate the effectiveness and superiority of RUD algorithm.The entropy regularization method for deep pseudo-deterministic policy gradient algorithm improves the exploration capabilities of the deterministic policy gradient algorithm by regularising the noise parameters.And the stability output of the deterministic policy is maintained.RUD algorithm enhances its data utilisation capabilities by changing the learning process of DDPG algorithm.And its learning model of centrally sampled updates after focused exploration is better suited to the Clipped Double Q-learning strategy.Both of the two methods enable DDPG algorithm to achieve higher performance in environments. | Keywords/Search Tags: | Machine Learning, Reinforcement Learning, Deep Reinforcement Learning, Deep Deterministic Policy Gradient, Exploration of the Parameter Space, Data Utilisation, Entropy Regularization | PDF Full Text Request | Related items |
| |
|