Research On Sample-efficient Deep Reinforcement Learning Methods

Posted on:2024-03-11

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y N Zhao

Full Text:PDF

GTID:1528307376982379

Subject:Computer Science and Technology

Abstract/Summary:

Reinforcement learning(RL)aims to solve sequential decision-making problems.The main idea is to formulate the task as a Markov decision process,and the agent constantly updates the policy through trail and error to maximize the reward signal and solve the tasks.This whole process does not need prior knowledge and manual intervention,and the problems can be solved with an appropriate reward signal.Reinforcement learning becomes one of the frontier research areas of artificial intelligence due to these properties.Combined with deep learning,deep reinforcement learning(drl)methods have made several important achievements in video games,intelligent control,recommendation systems and other fields,and have extensive application value.Although drl has shown good performance in some complex tasks,it still faces many challenges,one of which is the low sample efficiency of the algorithms.To obtain policy with certain level performance,the agent has to collect a large number of samples.For example,the agent need to collect samples of 6～9 days,with a total of about 20 million interactions to reach the level of human playing for about 20 minutes.A large number of interactions in the real world also increases costs,which greatly limits the further applications of drl.Improving the quality and the training speed of the learned policy is an important way to solve the problem of sample inefficiency,which enables the agent to finish the tasks as soon as possible and avoid making mistakes,finally reducing the number of interactions.This thesis improves the quality of the policy from three aspects,including the robustness,planning ability and the exploration ability.Furthermore,we propose a method to accelerate the training process,including the following four research contents:(1)We propose a distributional reinforcement learning method based on adaptive bounds to solve the sample inefficiency problem due to the lack of robustness of the learned policy.When facing the environment with randomness or uncertainty,the agent is prone to be affected by task-independent factors,leading to wrong decisions and making algorithms sample-inefficient.The proposed method approximates the expectation of the value function,at the same time it fits the whole distribution of the value function through reusing the samples,which improves the robustness of the policy and the sample efficiency of the algorithms.Furthermore,we obtain the confidence intervals of the upper and lower bounds of the learned value distribution by the bootstrapping method,and use these intervals to update the bounds of the value function distribution to obtain an accurate estimated value.Finally,a target policy smoothing strategy is proposed to stable the training process.The experimental results show that the proposed method protects the agent from unrelated factors,achieves stable performance and sample-efficient RL methods.(2)We propose a model-based reinforcement learning method based on diverse prediction trajectories to solve the sample inefficiency problem due to the lack of planning ability of the agent.When facing the environment with long horizons and irreversible states(such as Go,no regret),if the agent lacks the ability to plan and make decition blindly,which will increase the difficulty of finishing the task,even make the task impossible to complete,leading to sample-inefficient algorithms.This method makes use of the dynamic information contained in the samples to fit the environment model through supervised learning,so that the samples are reused efficiently.Next,the imagination trajectories,generated through the environment model and an exploratory rollout policy,are input into the policy network as additional information.In this way,the agent is endowed with planning and prediction abilities,which promotes the training process.The experiments demonstrate the proposed method achieves higher sample efficiency through fitting the environment model and generating imagination trajectories.(3)We propose a hierarchical reinforcement learning method based on state-covery to solve the problem of sample inefficiency in complex environments due to the lack of exploratory ability of the agent.In the environments with sparse and deceptive rewards,the agent is easy to fall into low-value regions and collect meaningless samples for training,increasing the number of samples to complete the task.This method based on hierarchical RL framework,is proposed to discover skills that differ from each other and train the generative model with the samples and maximize the state coverage when executing skills through an instrinsic reward.Finally the agent is encouraged to explore unseen regions.Experiments show that the proposed method is able to explore more about the environment,increases the exploration ability of the learned policy,avoids the agents falling into local regions,and improves the sample efficiency in challenging tasks.(4)We propose a twice sampling method in deep Q-learning to solve the problem of sample inefficiency due to the difficulty of sampling from a large experience buffer to construct a training set to promote training.Experience replay is an important tool to stable the training of RL methods.However,the experience buffer is often very large,and different samples have different effects on training.The proposed method analyzes the cumulative return distribution of the episodes,furthermore the training error of each sample is considered.In this way,the episode priority and the sample priority are obtained,and then the training data are sampled based on these two priorities.The experimental results demonstrate that the proposed method select the samples that accelerate the convergence of the network,increases the sample efficiency of the method and improves the performance of the learned policy.Finally,the proposed method is applied in the curling simulated system.First,we contruct a curling competition simulated system,then design the policy training method.Because the curling system is full of uncertainty and randomness,and the policy search space is huge.These factors seriously affect the sample efficiency of the algorithm,which brings challenges to the rapid learning of high-quality curling policies.Therefore,the"distributional reinforcement learning method based on adaptive bounds" method and the "twice sampling method in deep Q-network" are combined to train the policy in the curling environment.The experimental results show that the learned policy is able to tackle those challenges,achieves better performance and improves the sample efficiency of the algorithm in curling task.

Keywords/Search Tags:

Reinforcement learning, Deep Q learning, Estimation of value function distribution, Model-based reinforcement learning, Hierarchical reinforcement learning, Curling strategy learning

Related items

1	Research On Reinforcement Learning Methods Based On Bias-correction Of Value Function Estimation
2	Supervised Reinforcement Learning:methods And Applications
3	A Study Of Latent Space Hierarchical Algorithm For Distributed Deep Reinforcement Learning
4	Virtual Function Component Scaling Optimization Based On Deep Reinforcement Learning In Radio Access Network
5	Research On The Sparse Reward Problem Based On Hierarchical Reinforcement Learning
6	Research On Optimization Algorithms For Deep Reinforcement Learning Based On Familiarity
7	Research On Strategy Model Based On Deep Reinforcement Learning And Its Application
8	Research And Implementation On Hierarchical Reinforcement Learning
9	Research On Deep Reinforcement Learning Algorithm Based On Dual-Agent Cooperation
10	Research On Reinforcement Learning Based Control Method Of Magnetic Navigation AGV