Font Size: a A A

Exploration Strategy Of Deterministic Policy In Deep Reinforcement Learning

Posted on:2022-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:S T YangFull Text:PDF
GTID:2518306323966719Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Nowadays,Deep Reinforcement Learning(DRL)has become one of the core research of Artificial Intelligence(AI),and has been gradually applied to all aspects of life,such as Game,ADAS and Robots.However,the evaluation-based nature of reinforcement learning makes it highly in need of exploration,which requires continuous ex-ploration to find better decisions,or at least to prove that the current decision is the best.Therefore,exploration-exploitation dilemma have become a challenge in RL.This pa-per designs several exploration strategies for the deterministic policy methods in DRL.We take current mainstream deterministic policy gradient methods as the baselines to improve its exploration efficiency and generalization ability.In order to solve the problem that traditional exploration strategy can only explore locally,we propose an efficient exploration strategy combined with stochastic policy for deep reinforcement learning.By combining the deep deterministic policy methods,including DDPG and TD3,and the proposed exploration method,the algorithm called stochastic guidance for deterministic policy gradient(SGDPG)and stochastic guidance for TD3(SGTD3)is obtained.We take advantage of the exploration ability of stochastic policies and use the experience generated by stochastic policies to train deterministic policies,which encourages deterministic strategies to learn to explore without any heuristic.Aiming at the problem that the exploration noise of traditional exploration strategy is not task-related,another more efficient exploration method based on posterior sampling and variational inference is proposed,which is alse based on DDPG and TD3,and called embedding distribution for deterministic policy gradient(EDDPG)and embedding distribution for TD3(EDTD3).Our method exploits variational inference to to infer a distribution of latent variable upon the experience.The latent variable could be sampled from the distribution and then is used as part of RL agent's input.Then a reasonable and feasible model training process is designed by combining with posterior sampling,which improves the exploration ability and the sample efficiency of deterministic policies.In order to fit the distribution of latent variable more accurately,on the basis of EDDPG,we propose an optimization method named Bootstrapped EDDPG.By combining the Bootstrapping technique,Bootstrapped EDDPG uses multiple inference networks to fit the distribution of latent variable and outputs multiple probability distributions for statistics.Finally,a more accurate probability statistical framework for latent variable is obtained.In order to evaluate the performance of the proposed methods,we conducted a large number of experiments in complex environments.The experimental results verified the effectiveness of the first and the second methods in the continuous control environments and the sparse reward environments,and their exploration efficiency and generalization ability are better than that of baselines.The performance of EDDPG and EDTD3 was better than that of SGDPG and SGTD3,and the optimization framework performed better.
Keywords/Search Tags:Deep reinforcement learning, Exploration-Exploitation dilemma, Deterministic policy, Stochastic policy, Posterior sampling, Variational inference
PDF Full Text Request
Related items