Font Size: a A A

Research On Reinforcement Learning Methods Towards Unfixed Tasks And Non-static Environments

Posted on:2019-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:S Y ChenFull Text:PDF
GTID:2428330545985297Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Reinforcement learning is one of the most important research topics in machine learning,which aims at learning a policy from trial and error that maximizes the cu-mulative reward by interacting with the environment autonomously.It has achieved significant progress in solving optimal sequential decision problems.However,tradi-tional reinforcement learning approaches are designed to work in fixed tasks and static environments.In many real-world problems,an agent needs not only to accomplish a fixed task,but instead a range of tasks.Moreover,the environments in read-world are commonly dynamic.This leads to the fact that the practicality and learning perfor-mance of reinforcement learning method in read world will both reduce.In order to enable reinforcement learning to cope with unfixed tasks,an agent can learn a meta-policy over a set of training tasks that are drawn from an underlying distribution.By maximizing the total reward summed over all the training tasks,the meta-policy can then be reused in accomplishing test tasks from the same distribution.However,in practice we face two major obstacles to train and reuse meta-policies well.First,how to identify tasks that are unrelated or even opposite with each other,in order to avoid their mutual interference in the training.Second,how to characterize task features,according to which a meta-policy can be reused.In this work,we propose the MAPLE approach that overcomes the two difficulties by introducing the shallow trail.Empirical studies on several controlling tasks verify that MAPLE can train meta-policies well and receives high reward on test tasks.A direct cause of the performance degradation is the high-variance and biased es-timation of the reward,due to the distribution shifting in dynamic environments.In this paper,we propose two techniques to alleviate the unstable reward estimation problem in dynamic environments,the stratified sampling replay strategy and the approximate regretted reward,which address the problem from the sample aspect and the reward aspect,respectively.Integrating the two techniques with Double DQN,we propose the Robust DQN method.We apply Robust DQN in the tip recommendation system in Taobao online retail trading platform.We firstly disclose the highly dynamic property of the recommendation application.We then carried out online A/B test to examine Robust DQN.The results show that Robust DQN can effectively stabilize the value estimation and,therefore,improves the performance in this real-world dynamic environment.
Keywords/Search Tags:reinforcement learning, meta-policy, shallow trail, stratified sampling replay, approximated regret reward
PDF Full Text Request
Related items