Using Task Prior In Reinforcement Learning Exploration

Posted on:2020-09-02

Degree:Doctor

Type:Dissertation

Country:China

Candidate:S D Yang

Full Text:PDF

GTID:1368330647450607

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Reinforcement learning is an important branch of machine learning.In recent years,reinforcement learning has been widely used in the man vs machine battle,automatic control and personalized recommendation system.In reinforcement learning,it is important to achieve efficient exploration.Recently,as the difficulty of reinforcement learning tasks increases(for example,reinforcement learning problems with sparse rewards and complex structures),the prior knowledge of tasks is used to accelerate the exploration of reinforcement learning.The prior knowledge of tasks mainly includes two categories.The first category is the prior knowledge about the optimal strategy,such as the expert's demonstration data under optimal policy;the second category is the prior about task structure,such as the hierarchical structure in some tasks.However,in practical theories and applications,existing studies have the following shortcomings:(i)the prior knowledge requires the expert's optimal strategy to obtain demonstration data which is costly in many tasks;(ii)the prior knowledge of task structure differs greatly among tasks which results in the fact that existing methods have insufficient generalization performance;(iii)theoretical analysis of infinite reinforcement learning problems with prior knowledge is lacking.In view of the above shortcomings,this paper mainly includes:1.For shortcomings(1)and(3),a stochastic multi-armed bandit method based on Near-optimal Mean Reward(No MR)was proposed.First,the background of No MR prior was introduced and the assumption that No MR was between optimal average reward and sub-optimal average reward was given.Then the stochastic multi-armed bandit problem with No MR was formally defined.Accordingly,the accumulation regret with uniform lower bound for this problem was proposed and the difficulty of this problem was defined.No MR-bandit algorithm was proposed to solve this problem.No MR was used to design a bimodal function to improve the efficiency of exploration,and the upper bound of the algorithm was given.For the situation that No MR is less than sub-optimal average reward,an algorithm called Cascade-bandit with logarithmic upper bound was proposed was proposed.Finally,the experimental results showed that No MR-bandit performed better than the compared algorithms.2.For shortcomings(1)and(3),an average reward reinforcement learning algorithm based on Constant Shifting Value(CSV)was proposed.Firstly,the background of CSV was introduced and the assumption that CSV was between optimal average reward and sub-optimal average reward was given,and then the average reward reinforcement learning problem based on CSV was formally defined.To solve this problem,CSV-learning was proposed.In CSV-learning,CSV was used as a reward shaping term to improve the exploration efficiency.This work innovatively proposed the termination condition of CSV-learning when some conditions were met.For large-scale problems,a CSV-learning method based on linear value function was proposed.Finally,the experimental results showed that the proposed tabular and linear function approximation CSV-learnings had better performance than the compared algorithms.3.For shortcomings(2)and(3),a Hard-Transiting reinforcement learning algorithm based on Transition Exploratory Bonus(TEB)was proposed.Firstly,a new task structure called Hard-Transiting was introduced and Hard-Transiting reinforcement learning was defined.MBIE-TEB algorithm was proposed to solve this problem,and in MBIE-TEB,according to the Hard-Transiting properties,TEB was used as the bonus in the Q value updating to encourage the agent to explore those stateaction pairs with small transition probability,and Q value was updated iteratively by the estimated environment model in the process of learning.This work theoretically proved that MBIE-TEB can learn to approximate the optimal policy withinpolynomial time steps.For large-scale problems,an improved DQN algorithm DQN-TEB based on TEB was proposed.TEB model was added to the traditional DQN architecture,which encouraged the agent to explore those state-action pairs with small transition probability in large-scale problems.Finally,the experimental results demonstrated that the performances of the proposed MBIE-TEB and DQN-TEB algorithms were better than that of the compared algorithms.

Keywords/Search Tags:

Stochastic Multi-armed Bandits, Reinforcement Learning, Exploration and Exploitation, Task Prior Knowledge

PDF Full Text Request

Related items

1	Intelligent Interference Strategy Generation Based On Reinforcement Learning
2	Exploration Strategy Of Deterministic Policy In Deep Reinforcement Learning
3	Researches On Efficient Exploration Driven By Reward Function
4	Multi-Armed Bandits with Applications to Markov Decision Processes and Scheduling Problems
5	Research On Reinforcement Learning Algorithm Based On Improved Action Decision Method
6	Reinforcement Learning Control Methods Based On Prior Knowledge Model:Studies And Implementation
7	Hierarchical reinforcement learning using automatic task decomposition and exploration shaping
8	Recommendation Cold Start Method Based On Multi-armed Bandits
9	Research On The Algorithm Of Personalized Learning System Based On Multi-armed Bandit
10	Research On Sample-efficient Reinforcement Learning Methods