Font Size: a A A

Optimization And Improvement Of Reinforcement Learning Based On Maximum Entropy Model

Posted on:2021-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:K JiangFull Text:PDF
GTID:2370330623468208Subject:Mathematics
Abstract/Summary:PDF Full Text Request
Generally,the goal of standard reinforcement learning is to find an optimal solution,but in the actual living environment,it is far from satisfying to master only one method.Therefore,mastering a variety of strategies is a necessary skill for the real world that is changing all the time.But in the development of reinforcement learning,exploration and utilization are always contradictory.Only enough exploration can explore new solutions,but too much exploration will make it fall into a dilemma that can not be mastered.For a long time,the exploration of reinforcement learning without model is responsible through seemingly random strategies.But this kind of heuristic exploration can not help them get more and better experience to use and learn independently.Therefore,we need to find an independent and effective way of exploration and utilization.In order to satisfy the independent and efficient exploration of the agent,it can also master a variety of solutions according to the exploration experience.Therefore,multi peak policy distribution with energy form is introduced to improve reinforcement learning.At present,an important achievement of this direction is to learn the maximum entropy policy through soft Q-learning.However,the general reinforcement learning needs to use the feedback reward of environment or add reward function to guide its policy to update,but there is almost no feedback reward that can be directly used in the actual environment,and adding reward function for each environment is also inefficient.Therefore,it also needs a way to explore and learn autonomously in the environment without any feedback reward.At present,there are some reinforcement learning algorithms that do not need any reward,including the Goal Distance Gradient algorithm,which uses the transfer distance as the key to update the policy and can be applied in almost any environment.However,at present,the Goal Distance Gradient method has not been combined with the maximum entropy model,so we will take the soft Q-learning method as the theoretical basis,and extend its maximum entropy model method to the target distance gradient algorithm.According to the similarities and differences between the target distance gradient method and the Deterministic Policy Gradient method,we use the maximum entropy method to the eyepiece distance gradient method,and proposes a GDG-Energy algorithm.Experimental results show that the Goal Distance Gradient algorithm combined with the maximum entropy model can obtain a variety of solutions in the environment,and retain the characteristics and properties of the original algorithm.Through experiments in four maze environments,GDG-Energy algorithm can achieve better results than DDPG-Energy in sparse reward environment.At the same time,for the local optimal problem,GDG-Energy algorithm can get the global optimal solution through a lot of exploration.
Keywords/Search Tags:Reinforcement learning, Maximum entropy model, Goal distance gradient, Deterministic policy gradient, Soft Q-learning
PDF Full Text Request
Related items