Research On Reinforcement Learning Methods Towards Unfixed Tasks And Non-static Environments

Posted on:2019-11-12

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Chen

Full Text:PDF

GTID:2428330545985297

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Reinforcement learning is one of the most important research topics in machine learning,which aims at learning a policy from trial and error that maximizes the cu-mulative reward by interacting with the environment autonomously.It has achieved significant progress in solving optimal sequential decision problems.However,tradi-tional reinforcement learning approaches are designed to work in fixed tasks and static environments.In many real-world problems,an agent needs not only to accomplish a fixed task,but instead a range of tasks.Moreover,the environments in read-world are commonly dynamic.This leads to the fact that the practicality and learning perfor-mance of reinforcement learning method in read world will both reduce.In order to enable reinforcement learning to cope with unfixed tasks,an agent can learn a meta-policy over a set of training tasks that are drawn from an underlying distribution.By maximizing the total reward summed over all the training tasks,the meta-policy can then be reused in accomplishing test tasks from the same distribution.However,in practice we face two major obstacles to train and reuse meta-policies well.First,how to identify tasks that are unrelated or even opposite with each other,in order to avoid their mutual interference in the training.Second,how to characterize task features,according to which a meta-policy can be reused.In this work,we propose the MAPLE approach that overcomes the two difficulties by introducing the shallow trail.Empirical studies on several controlling tasks verify that MAPLE can train meta-policies well and receives high reward on test tasks.A direct cause of the performance degradation is the high-variance and biased es-timation of the reward,due to the distribution shifting in dynamic environments.In this paper,we propose two techniques to alleviate the unstable reward estimation problem in dynamic environments,the stratified sampling replay strategy and the approximate regretted reward,which address the problem from the sample aspect and the reward aspect,respectively.Integrating the two techniques with Double DQN,we propose the Robust DQN method.We apply Robust DQN in the tip recommendation system in Taobao online retail trading platform.We firstly disclose the highly dynamic property of the recommendation application.We then carried out online A/B test to examine Robust DQN.The results show that Robust DQN can effectively stabilize the value estimation and,therefore,improves the performance in this real-world dynamic environment.

Keywords/Search Tags:

reinforcement learning, meta-policy, shallow trail, stratified sampling replay, approximated regret reward

PDF Full Text Request

Related items

1	Study Of Robot Arm Control Based On Deep Reinforcement Learning
2	Research On Experience Replay Method For Deep Reinforcement Learning
3	Research On Optimization Methods Of The Experience Replay Mechanism For Off-policy Reinforcement Learning
4	Research On Reward Optimization In Reinforcement Learning
5	Research On Deep Reinforcement Learning Algorithm Based On The Combination Of Intrinsic Reward And Auxiliary Tasks
6	Research On Experience Replay In Deep Reinforcement Learning
7	Research On Sample Generation And Selection Methods For Deep Reinforcement Learning
8	Improvement And Application Of Deep Reinforcement Learning Based On Experience Replay Mechanism
9	Research On Motion Planning In Dynamic Environment Based On Deep Reinforcement Learning
10	Research And Implementation On Game Control Algorithm Based On Deepening Reinforcement Learning