Font Size: a A A

Research On Uncertainty-weighted Offline Reinforcement Learning

Posted on:2024-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:B H XieFull Text:PDF
GTID:2568306932956039Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of machine learning,artificial intelligence has been applied to a wide range of scenarios.Significant progress has been made in reinforcement learning for video,board and card games.However,how to apply reinforcement learning efficiently in real-world settings where the exploration costs are relatively high,such as healthcare,autonomous vehicles and robotics,is still a challenge in the development of reinforcement learning.Offline reinforcement learning avoids exploration costs by learning policies effectively from previously collected datasets.But lacking of interaction with the environment continually brings the distribution shift between the learned policy and the behavior policy.The training process exacerbates the extrapolation errors from out-ofdistribution(OOD)actions or states,which leads to training failure.To tackle this issue,most existing approaches are typically divided into two categories according to whether using OOD actions during policy evaluation:reinforcement learning-based(RL-based)methods and imitation learning-based(IL-based)methods.The former restrict the distance between the target policy and the behavioral policy by value function regularization or policy constraints.These approaches require a trade-off between the accuracy of value estimation and policy improvement and may lead to over-conservatism.The latter apply imitation learning on the dataset rather than querying the values of unseen actions to avoid extrapolation error.But at the same time,avoiding OOD samples easily leads to limited performance improvement and loses the possibility to generalize beyond the dataset.To address the above problems,this thesis proposed two offline reinforcement leearning methods from the perspective of uncertainty in deep learning.To address the problem that value estimation is too conservative in RL-based offline reinforcement learning methods,this thesis proposed Double Actors and Uncertainty-Weighted Critics for Offline Reinforcement Learning,DAUWC.During policy evaluation,a moderately optimistic state-value function is learned by double actors,while uncertainty estimation is introduced through an ensemble of state-action value networks.During policy extraction,the learned policy is implicitly constrained to the behavioral policy by weighting the advantage value to enhance the stability of training.To address the problem of limited performance improvement in IL-based offline reinforcement learning methods,this thesis proposed Uncertainty-Weighted Implicit QLearning,UWIQL.The offline data is fully utilized during policy evaluation,avoiding using OOD samples.Uncertainty estimation and expectile regression are used to estimate the state value function to effectively improve the generalization performance of the value function.During policy extraction,advantage-weighted behavioral cloning and uncertainty optimization are used to maximize the advantage value function,thereby improving the exploitation ability.Experimental results on D4RL,a standard benchmark for offline RL,show that the proposed methods significantly outperform state-of-the-art methods with higher normalized scores in most of the tasks,paving the way for practical industrial applications of reinforcement learning.In future work,we will continue our research on offline reinforcement learning and apply our methods on more industrial scenarios.
Keywords/Search Tags:Machine Learning, Reinforcement Learning, Offline Reinforcement Learning, Imitation Learning
PDF Full Text Request
Related items