Font Size: a A A

Research On Deterministic Policy Gradient Algorithms With Continuous Control Task

Posted on:2022-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y WangFull Text:PDF
GTID:2518306317489524Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As one of the main algorithms in reinforcement learning,model free deep reinforcement learning algorithm can learn autonomously by interacting with the environment without modeling the environment.Although it has made great progress in a series of challenging decision-making and control tasks,it has also achieved many successes,the research of deep reinforcement learning under continuous control task is still in its infancy,and there are still some problems and challenges.For example,dimension explosion,poor generalization ability in random environment,inefficient use of sample data,fragile convergence characteristics and easy to fall into local optimal strategy.These problems lead to the need for detailed hyperparameter tuning of most models,which seriously limits the applicability of deep reinforcement learning in complex real-world fields.For the Actor-Critic deep reinforcement learning algorithm of continuous action space,the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient(TD3)algorithm mitigates the overestimation issue in Deep Deterministic Policy Gradient(DDPG)algorithm.However,it can lead to a large underestimated bias.Also,using the lower bound can seriously inhibit exploration if it is far from the true Q-function.Moreover,the efficiency of sample utilization is too low.The purpose of this paper is to aim at these problems,improve the existing deterministic policy gradient algorithm and propose a new algorithm.In this thesis,the main contents and results are as follows :Firstly,based on the continuous control problem,this paper proposes a method to smooth the critic network.Because the action space of continuous control is infinite,we can only perform a certain action each time,and update the network gradient only according to this action and reward each time,which easily leads to the deep and narrow poles of the network.Therefore,in order to solve this problem,this paper considers that the critic function should be smooth and continuous in the ideal case,and proposes a method to smooth the objective function of the critic network,which reduces the convergence difficulty of the performer network and improves the efficiency of sample utilization.In or der to verify the effect of the new method,we use the improved DPG and DDPG algorithm to carry out experiments in the open source simple biped robot environment provided by Open AI.The experimental results show that the convergence speed of the smoothed algorithm is improved.Secondly,in view of the problem that the value function affects the upper limit of the optimal strategy,this paper proposes three smoothing functions through three different ideas.1.Polarization idea: we think that in the framew ork of executor critic,it is important for the critic to rank the discount cumulative value between actions in accordance with the real discount cumulative value.By widening the gap between action estimates,the error is insufficient To influence the order of key actions.2.Conservative thinking: we think that rai sing the lower limit of valuation within a certain range of actions will make the model more robust.3.Entropy thought: from the perspective of the accuracy of the evaluation,the entropy value of the action with high evaluation is lower,and the entropy value of the action with low evaluation is higher,so the confidence degree of the action with low entropy is higher,and the weight is also higher.In this paper,we apply three different smoothing methods to DDPG and TD3,and test them in two environments respectively.The experimental results show that the three smoothing functions proposed in this paper can improve the upper bound of the optimal strategy,but also increase the probability of unstable convergence.Thirdly,for the problem of model instability,we find that the robustness of actors in training is positively correlated with the convergence ability of the model.The actor falling into local optimum and the critic network fluctuati ng too fast will affect the learning effect.To solve this problem,this paper proposes a delay update algorithm based on dual actor dual critic architecture,which reduces the probability of the actor falling into local optimum and the update rate of crit ic network,so that the actor can get stable convergence.In order to verify the effectiveness of the algorithm,we have carried out a lot of experiments in the open source environment of gym provided by Open AI in the environment of difficult biped robot.Through the single executor double executor comparative experiment,it is verified that the double executor can make the convergence of the model more robust and significantly reduce the "avalanche" phenomenon of the model.Finally,we combine these methods to get a new algorithm Soft-smooth Delay Double Deep Deterministic Policy Gradient Algorithms(SD4PG).We conduct extensive experiments on challenging continuous control tasks,and results show that SD4 PG outperforms state-of-the-art methods.
Keywords/Search Tags:Deep Reinforcement Learning, Continuous Control Tasks, Actor-Critic, Smooth, Deterministic policy gradient
PDF Full Text Request
Related items