Font Size: a A A

Using Task Prior In Reinforcement Learning Exploration

Posted on:2020-09-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:S D YangFull Text:PDF
GTID:1368330647450607Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Reinforcement learning is an important branch of machine learning.In recent years,reinforcement learning has been widely used in the man vs machine battle,automatic control and personalized recommendation system.In reinforcement learning,it is important to achieve efficient exploration.Recently,as the difficulty of reinforcement learning tasks increases(for example,reinforcement learning problems with sparse rewards and complex structures),the prior knowledge of tasks is used to accelerate the exploration of reinforcement learning.The prior knowledge of tasks mainly includes two categories.The first category is the prior knowledge about the optimal strategy,such as the expert's demonstration data under optimal policy;the second category is the prior about task structure,such as the hierarchical structure in some tasks.However,in practical theories and applications,existing studies have the following shortcomings:(i)the prior knowledge requires the expert's optimal strategy to obtain demonstration data which is costly in many tasks;(ii)the prior knowledge of task structure differs greatly among tasks which results in the fact that existing methods have insufficient generalization performance;(iii)theoretical analysis of infinite reinforcement learning problems with prior knowledge is lacking.In view of the above shortcomings,this paper mainly includes:1.For shortcomings(1)and(3),a stochastic multi-armed bandit method based on Near-optimal Mean Reward(No MR)was proposed.First,the background of No MR prior was introduced and the assumption that No MR was between optimal average reward and sub-optimal average reward was given.Then the stochastic multi-armed bandit problem with No MR was formally defined.Accordingly,the accumulation regret with uniform lower bound for this problem was proposed and the difficulty of this problem was defined.No MR-bandit algorithm was proposed to solve this problem.No MR was used to design a bimodal function to improve the efficiency of exploration,and the upper bound of the algorithm was given.For the situation that No MR is less than sub-optimal average reward,an algorithm called Cascade-bandit with logarithmic upper bound was proposed was proposed.Finally,the experimental results showed that No MR-bandit performed better than the compared algorithms.2.For shortcomings(1)and(3),an average reward reinforcement learning algorithm based on Constant Shifting Value(CSV)was proposed.Firstly,the background of CSV was introduced and the assumption that CSV was between optimal average reward and sub-optimal average reward was given,and then the average reward reinforcement learning problem based on CSV was formally defined.To solve this problem,CSV-learning was proposed.In CSV-learning,CSV was used as a reward shaping term to improve the exploration efficiency.This work innovatively proposed the termination condition of CSV-learning when some conditions were met.For large-scale problems,a CSV-learning method based on linear value function was proposed.Finally,the experimental results showed that the proposed tabular and linear function approximation CSV-learnings had better performance than the compared algorithms.3.For shortcomings(2)and(3),a Hard-Transiting reinforcement learning algorithm based on Transition Exploratory Bonus(TEB)was proposed.Firstly,a new task structure called Hard-Transiting was introduced and Hard-Transiting reinforcement learning was defined.MBIE-TEB algorithm was proposed to solve this problem,and in MBIE-TEB,according to the Hard-Transiting properties,TEB was used as the bonus in the Q value updating to encourage the agent to explore those stateaction pairs with small transition probability,and Q value was updated iteratively by the estimated environment model in the process of learning.This work theoretically proved that MBIE-TEB can learn to approximate the optimal policy withinpolynomial time steps.For large-scale problems,an improved DQN algorithm DQN-TEB based on TEB was proposed.TEB model was added to the traditional DQN architecture,which encouraged the agent to explore those state-action pairs with small transition probability in large-scale problems.Finally,the experimental results demonstrated that the performances of the proposed MBIE-TEB and DQN-TEB algorithms were better than that of the compared algorithms.
Keywords/Search Tags:Stochastic Multi-armed Bandits, Reinforcement Learning, Exploration and Exploitation, Task Prior Knowledge
PDF Full Text Request
Related items