Optimization And Improvement Of Reinforcement Learning Based On Maximum Entropy Model

Posted on:2021-02-17

Degree:Master

Type:Thesis

Country:China

Candidate:K Jiang

Full Text:PDF

GTID:2370330623468208

Subject:Mathematics

Abstract/Summary:

PDF Full Text Request

Generally,the goal of standard reinforcement learning is to find an optimal solution,but in the actual living environment,it is far from satisfying to master only one method.Therefore,mastering a variety of strategies is a necessary skill for the real world that is changing all the time.But in the development of reinforcement learning,exploration and utilization are always contradictory.Only enough exploration can explore new solutions,but too much exploration will make it fall into a dilemma that can not be mastered.For a long time,the exploration of reinforcement learning without model is responsible through seemingly random strategies.But this kind of heuristic exploration can not help them get more and better experience to use and learn independently.Therefore,we need to find an independent and effective way of exploration and utilization.In order to satisfy the independent and efficient exploration of the agent,it can also master a variety of solutions according to the exploration experience.Therefore,multi peak policy distribution with energy form is introduced to improve reinforcement learning.At present,an important achievement of this direction is to learn the maximum entropy policy through soft Q-learning.However,the general reinforcement learning needs to use the feedback reward of environment or add reward function to guide its policy to update,but there is almost no feedback reward that can be directly used in the actual environment,and adding reward function for each environment is also inefficient.Therefore,it also needs a way to explore and learn autonomously in the environment without any feedback reward.At present,there are some reinforcement learning algorithms that do not need any reward,including the Goal Distance Gradient algorithm,which uses the transfer distance as the key to update the policy and can be applied in almost any environment.However,at present,the Goal Distance Gradient method has not been combined with the maximum entropy model,so we will take the soft Q-learning method as the theoretical basis,and extend its maximum entropy model method to the target distance gradient algorithm.According to the similarities and differences between the target distance gradient method and the Deterministic Policy Gradient method,we use the maximum entropy method to the eyepiece distance gradient method,and proposes a GDG-Energy algorithm.Experimental results show that the Goal Distance Gradient algorithm combined with the maximum entropy model can obtain a variety of solutions in the environment,and retain the characteristics and properties of the original algorithm.Through experiments in four maze environments,GDG-Energy algorithm can achieve better results than DDPG-Energy in sparse reward environment.At the same time,for the local optimal problem,GDG-Energy algorithm can get the global optimal solution through a lot of exploration.

Keywords/Search Tags:

Reinforcement learning, Maximum entropy model, Goal distance gradient, Deterministic policy gradient, Soft Q-learning

PDF Full Text Request

Related items

1	Deep Reinforcement Learning With Exploratory Noise
2	Research On The Design Of Agent-based Decision Model For Games Based On Reinforcement Learning
3	Some New Algorithms Of Reinforcement Learning And Their Theoretical Study
4	Cas-GAN: An Approach Of Dialogue Policy Learning Based On Gcn And Rl Techniques
5	Adaptive Optics Wavefront Control Based On Deep Reinforcement Learning
6	Research On Intelligent Decision Model Based On Deep Reinforcement Learning
7	SgRNA Activity Prediction Method Based On Reinforcement Learning
8	The Optimization Algorithm Research Of Stochastic Gradient Descent Based On Convolutional Neural Network
9	Research On Differential Games Of Air Combat Based On Reinforcement Learning
10	Study On Reinforcement Learning Of Pigeons' Visual-Behavioral Decision-Making