Font Size: a A A

Research On Exploration Enhancement Deep Reinforcement Learning Methods

Posted on:2021-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:T Y LiFull Text:PDF
GTID:2428330620978837Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Deep reinforcement learning is one of the hottest research topics in the field of artificial intelligence in recent years.Agents in reinforcement learning need to continuously make decisions in interaction with the environments to complete the specific tasks,which means the training of agents often requires a large amount of sample data as support.Therefore,in some environments where the number of samples is scarce or the samples are difficult to obtain,it is often difficult to obtain satisfactory training results,which limits further application of deep reinforcement learning in practical problems.To enable agents to quickly obtain high-quality training samples,the efficient exploration is an effective solution.To this end,this thesis studies how to enhance the agent's exploration ability from two perspectives(parameter distribution representation and demonstration-assisted training),and proposes four exploration enhancement deep reinforcement learning methods.The main work includes:(1)Aiming at the problem that parameterized representation can easily lead to algorithms instability,an inference-based posterior parameter distribution optimization(IPPDO)is proposed.On the one hand,combining the concept of lower bound of evidence in probability inference,from the perspective of model observable variables and latent variables,the correspondence between parameter distribution and reinforcement learning targets is established,and the parameter distribution optimization objective function is constructed.On the other hand,by adding an additional activation function to the standard deviation of the parameter distribution to adjust the mapping of the parameter distribution to the network weights,the adaptive adjustment between the fixed value of the parameter and the parameter distribution is realized,which further improves algorithm stability.In addition,IPPDO is a deep reinforcement learning method based on off-policy,which can effectively improve the sample utilization and accelerate learning by using techniques such as experience replay.(2)In view of the problem that the parameter distribution is easily disturbed by the deviation and variance of the policy gradient in the optimization process,which leads to instability and low learning efficiency,a proximal parameter distribution optimization(PPDO)is proposed.Applying the current parameters of the network to approximate the real parameters in Reptile,the learning is accelerated by updating the parameter distribution twice: the first parameter distribution update can be done directly using a policy gradient(such as IPPDO),at the second time,the parameter distribution is updated based on the parameter distribution after the first update.Further,combining importance sampling and drawing on the proximal parameter optimization ideas,the update range between the two parameter distributions before and after is limited by using the KL divergence penalty term to ensure that the parameter distribution can always move toward the optimal direction during the optimization process.(3)Aiming at the problem that the existing demonstration-based exploration does not make full use of the demonstrations during the training process.The thesis proposes demonstration-based policy optimization(DPO)and the demonstrations are used to improve the optimization efficiency of the network parameters from the two aspects of pre-and formal training.In the network pre-training,the demonstrations are used as labeled samples and we construct an additional link with supervised learning to guide the network pre-training,the pre-training objective functions based on demonstrations is continuous and discrete action spaces were constructed respectively.Furthermore,in order to improve the sampling efficiency of pre-training samples,an attention mechanism based on states and actions is proposed,so that the agent can focus more on high-quality samples during the training process.In the formal training phase of the network,a regularizer based on the demonstrations is added to the original objective function,so that the demonstrations can directly impact on the optimization of parameters through the objective function.(4)Aiming at the problem that the existing internal reward mechanism based on distributed entropy cannot accurately guide the network exploration during training,a demonstration-based internal reward mechanism is proposed.First,by analyzing the internal relationship between the demonstrations and internal rewards,a specific expression form of the internal rewards based on the demonstrations is proposed;Then,combining the advantages of deep learning in function representation,a set of neural networks is used to calculate the internal rewards,with the idea of Generative Adversarial Nerworks to optimize the networks;Finally,the internal rewards calculation methods based on the Actor network and the experience pool are separately proposed,and the calculation process of the value function is split into two independent parts of external rewards and internal rewards,so that the internal reward mechanism based on demonstrations can be unified to the conventional deep reinforcement learning framework to improve the efficiency of parameter optimization.OpenAI Gym and MuJoCo are used as experimental platforms respectively to compare with the current mainstream deep reinforcement learning algorithms on corresponding tasks.Experimental results show that the proposed algorithm can continue to obtain higher returns in the same time and have better performance.
Keywords/Search Tags:deep reinforcement learning, exploration, parameter distribution, demonstration
PDF Full Text Request
Related items