Towards Sample-efficient Deep Reinforcement Learning

Posted on:2022-08-13

Degree:Doctor

Type:Dissertation

Country:China

Candidate:G Q Liu

Full Text:PDF

GTID:1488306323482454

Subject:Cyberspace security

Abstract/Summary:

PDF Full Text Request

In recent years,cyberspace has been closely connected with people's lives just like the real world.Therefore,how to maintain the security of cyberspace has become an important topic.Reinforcement learning(RL),as one of the most important paradigms of machine learning,widely used to describe and solve the problem where the agent learns policies to maximize returns or achieve specific goals by interacting with the environment.Thanks to the rapid development of deep neural networks,deep reinforcement learning(DRL)have made dramatic breakthroughs in many fields,such as cyberspace security,video games,robot control and autonomous driving.Despite the success,the sample efficiency of current deep reinforcement learning algorithms is not so high.For example,for Atari games,the advanced deep reinforcement learning algorithms,such as Muzero and Agent57,need approximately 10 to 50 years of experience to accomplish the remarkable performance.Such problems seriously hinder the realworld application of deep reinforcement learning techniques,especially when the cost of environmental interaction is expensive,such as cyberspace security,healthcare,autonomous driving.Although the research on improving sample efficiency of deep reinforcement learning has made some progress,there are still four key issues requiring active exploration by researchers:1)Reusing data sample:how to effectively use sampled data to update the policy is a very critical point in reinforcement learning.Some advanced RL algorithms(e.g.,evolution strategies)only use the current data sample for one gradient update,which is not very efficient.2)Learning state representation:real-world decisionmaking problems often have highdimensional continuous state spaces,and these state spaces may be redundant.Thus,how to obtain an effective and compact state representation is essential.Such state representation can greatly reduce the size of the original state space,thereby improving the sample efficiency of reinforcement learning algorithms.3)Leveraging expert demonstration:in many real scenarios,the agent has access to not only reward signals from the environment,but also some expert demonstrations.Based on this fact,we are concerned about how to effectively use these expert demonstrations,to improve the sample efficiency of current reinforcement learning algorithms.4)Utilizing perfect information:in some imperfect-information games,such as mahjong,there is often rich hidden information,such as other players' hands and wall cards.Without access to such information,the learning speed of reinforcement learning could be very slow.To speed up the RL training,we investigate how to utilize the guidance of perfect information to improve the sample efficiency during training.In this dissertation,we focus on the above four key issues,and propose four corresponding algorithms as follows:1)Trust Region Evolution Strategies The core idea of this work is to cleverly reuse current data samples to perform multiple gradient updates,thereby greatly improving the sample efficiency of the evolution strategies algorithm.Specifically,in the traditional evolution strategies algorithm,the current data sample is only used for one gradient update and is discarded after use.This work proposes a new surrogate objective function,which allows us to reuse current data for multiple policy updates and improve sample efficiency.Moreover,we prove that while optimizing the surrogate function,the objective function of the original evolution strategies algorithm is also optimized at the same time,which ensures the correctness of optimizing the surrogate function.We conducted experiments on five tasks on the most widely used simulated robot platform(MuJoCo),and the results clearly showed that the new algorithm consistently and significantly improved the sample efficiency of the original evolution strategies algorithm in all tasks.2)Returnbased Contrastive Representation Learning for Reinforcement Learning The core idea of this work is to use the return distribution,the most important feedback signal in reinforcement learning,to design a new state representation learning algorithm,thereby greatly improving the sample efficiency of the base RL algorithm.Specifically,some previous works do not consider the characteristics of RL problems and are unsupervised.Different from these works,this work directly uses the reward distribution as the learning signal of the state representation.Such state representation guarantees that the value function of the original Markov decision process can be accurately represented in the abstract Markov decision process,and at the same time reduces the size of the original state action space to a greater extent,thereby improving sample efficiency.We conducted experiments on two types of tasks:Atari Games and DMControl Suite.The results show that our state representation learning algorithm can greatly improve the sample efficiency of the base RL algorithm.3)Demonstration Actor Critic The core idea of this work is to design a brand new reward reshaping algorithm and make better use of expert demonstration to accelerate reinforcement learning.Specifically,there have been some algorithms proposed to use expert demonstration to accelerate the training of reinforcement learning.Despite their success,these existing algorithms treat all states in the state space equally when updating policies,ignoring the fact that there are direct action supervision signals for those states in the expert demonstration.For this reason,we propose a new reward reshaping objective function.By optimizing the objective function,the policy can be updated according to the current state.For those states in the expert demonstration,the existing supervision signals are directly used in the training.We conducted experiments on five difficult sparsereward tasks on the simulated robot platform(MuJoCo).The experimental results showed that the new algorithm,with the help of a small amount of expert example data,also greatly improved the sample efficiency of the original reinforcement learning algorithm.4)Oracle Guiding The core idea of this work is to utilize the guidance of perfect information to improve the sample efficiency of deep reinforcement learning student model training in the imperfectinformation games.Specifically,we first introduced an oracle agent,which can see all information,including normal information,as well as additional perfect information.With the(unfair)access to the perfect information,the oracle agent will easily become a master of Mahjong after RL training.To leverage the oracle agent to guide and accelerate the learning of our normal agent,we gradually lost the perfect features by adding masks,so that the model transits from the oracle agent to a normal agent.The experiemental results based on an extremely large number of offline mahjong games show that with the help of the oracle guiding algorithm,the sample efficiency of Suphx has been greatly improved.

Keywords/Search Tags:

Reinforcement Learning, Deep Neural Network, Sample Efficiency, Reusing Data Sample, Learning State Representation, Leveraging Expert Demonstration, Utilizing Perfect Information

PDF Full Text Request

Related items

1	Deep Reinforcement Learning With Self-Generated Expert Samples
2	Research On Sample Generation And Selection Methods For Deep Reinforcement Learning
3	Sample Efficiency Improvement Method Of Deep Reinforcement Learning And Its Application In Video Bitrate Control
4	Scaling reinforcement learning through better representation and sample efficiency
5	Sample Efficiency In Reinforcement Learning
6	Deep Reinforcement Learning Algorithm Based On Model Control
7	Study On Parkinson Speech Data Mining Method Based On Sample And Feature Learning
8	Research On Deep Learning Methods For Small-sample Image Classification
9	Research On Image Classification Based On Deep Learning
10	Research On Reinforcement Learning With Demonstration Data And Ranking Algorithms