Font Size: a A A

Deep Reinforcement Learning Based On Value Distribution And Diversity Policy

Posted on:2022-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:L M FengFull Text:PDF
GTID:2518306533972359Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Artificial intelligence is a very broad area.As a subfield of artificial intelligence,deep reinforcement learning is gradually becoming a hot research topic.Deep reinforcement learning agent receives feedback from the environment to adjust its own policy,that is to say,deep reinforcement learning agent interacts with the environment in a large amount and learns the policy needed to complete a specific task.However,the large number of interactions in both simulation and real environments inevitably causes a waste of computational resources.Therefore,how to reduce the number of interactions and improve the training efficiency is a core problem of deep reinforcement learning.Further,enhancing the exploration ability of agent is the key to improve the learning efficiency of deep reinforcement learning algorithms.In this paper,the main work to increase the exploration ability of agent from two perspectives of value distribution and diversity policy is as follows.(1)Aiming at the problem that conservative estimation reduces the exploration ability of agent,an optimistic deep reinforcement learning based on value distribution is proposed.By combining quantile regression with Actor-Critic,the output of Critic is transformed from the mean to the distribution.In the process of learning the distribution of state-action value,the minimum value of the output corresponding to the same quantile fraction in the two Critic network is taken for calculating the target quantile value,thus alleviating the overestimation problem.In the exploration process,the high quantile value in the state-action value distribution are used to guide the optimistic exploration strategy for exploration,while limiting the maximum Kullback-Leibler divergence between the optimistic exploration policy and the target policy to ensure the stability of the algorithm.(2)Aiming at the problem that the Gaussian policy is relatively simple,which limits the exploration ability of agent,a deep reinforcement learning method based on value distribution and normalizing flows policy is proposed.Firstly,the Gaussian policy of actor output in traditional deep reinforcement learning algorithm is transformed into a more expressive normalizing flows policy by using normalizing flows,and the exploration ability of agent is enhanced by enhancing the expressiveness of policy distribution.Secondly,the network structure of Critic in the traditional deep reinforcement learning algorithm is reconstructed using an implicit quantile network to enhance the stability of the training process by learning the distribution of state-action values.(3)In view of the problem that the algorithm may fall into the local optimal solution due to the disappearance of gradient in the process of updating,and then affect the exploration ability of agent,a diversity evolution policy deep reinforcement learning is proposed.The cross-entropy method in estimation of distribution algorithms,maximum mean discrepancy and deep reinforcement learning are combined,and by using the maximum mean discrepancy as a measure of the distance between different policies,some of the policies in the population maximize the distance between them and the previous generation of policies while maximizing the cumulative reward during the gradient update.Further,combining the cumulative returns and the distance between policies as the fitness of the population encourages more diversity in the offspring policies,which in turn can reduce the risk of falling into local optimal due to the disappearance of the gradient and improve the learning efficiency.In the mujoco test environment,compared with the current mainstream deep reinforcement learning algorithm,the algorithm proposed in this paper can get higher return in less interaction with the environment,and has higher learning efficiency.The paper contains 25 figures,9 tables,and 87 references.
Keywords/Search Tags:deep reinforcement learning, value distribution, diversity policy, normalizing flows, exploration
PDF Full Text Request
Related items