Font Size: a A A

Researches On Efficient Exploration Driven By Reward Function

Posted on:2021-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:C LiFull Text:PDF
GTID:2428330620463267Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The balance between exploration and exploitation has always been one of the focuses of reinforcement learning researches.The exploration helps the agent understand the environment more comprehensively and make better decisions,while the exploitation helps the agent make current optimal decisions based on its current cognition of the environment.Reinforcement learning generates training data by interacting with the environment to evaluate and update the learned policies,rather than guiding the learning process based on the correct policies.Therefore,reinforcement learning needs efficient exploration during the learning process.Reinforcement learning acquires rewards through interacting with the environment,and then learns optimal policies by maximizing cumulative rewards,so reward functions in the environments will influence directly the results of reinforcement learning.When rewards are dense in the environments,traditional exploration methods such as Optimistic Initial Values,Upper-Confidence-Bound Action Selection and Thompson Sampling could improve exploration efficiencies of reinforcement learning algorithms greatly and supply sufficient theoretical proofs through measuring the uncertainties in the learning process of value functions.However they are only applicable to discrete state spaces.When state spaces are large or even continuous,these exploration methods may be not applicable.If rewards are sparse in the environments,reinforcement learning algorithms may hardly acquire positive rewards during interactions with the environments,which may result in poor performance.Hindsight Experience Replay(HER)replays some new goal states except for original goal states at the end of each episode so as to increase the proportion of data with positive rewards in the all training data.Then,it learns goal-conditioned policies generally through the similarities between new goal states replayed and original goal states.However,when original goal states are difficult to be reached,the correlations between new goal states replayed by HER and original goal states will be weak,which may destroy reinforcement learning algorithms.Aiming at problems existing in environments with dense rewards or sparse rewards,this thesis conducts researches on improving exploration efficiencies of algorithms through reward functions.The main contents include:(1)Propose an exploration algorithm named RMAX-KNN based on the adaptive discretization of state space,which is applicable to the environments with low-dimensional continuous state spaces when rewards are dense.This algorithm adaptively discretizes the low-dimensional continuous state space and updates the Q value functions of these discrete points.At the same time,it measures the uncertainty of the current state-action pair according to the number of discrete points that belong to K nearest neighbors of this pair but are not in the given distance threshold,and combines with KNN regression to apply RMAX to environments with low-dimensional continuous state spaces.The degree of adaptive discretization to the state space influences the uncertainties of real state-action pairs,and Q value functions of the uncertain state-action pairs will encourage the agent to explore the environment further.This work could improve the exploration efficiencies of Q-learning and Sarsa algorithms,and theoretically proves that RMAX-KNN is a Probably Approximately Correct(PAC)optimal exploration algorithm.(2)Propose a multi-stage hindsight experience replay algorithm to improve the performances of reinforcement learning algorithms when rewards are sparse.On the one hand,this algorithm decomposes the original task into several stages with increasing difficulties.On the other hand,the agent learns goal-conditioned policies to reach goal state area specified in each stage by HER successively.It leads to an explicit form of curriculum learning and strengthens the relevance between goal states replayed and goal states in each stage.When goal states are difficult to be reached in multi-goal environments with sparse rewards,this algorithm can help reinforcement learning algorithms learn goal-conditioned policies.Aiming at some problems existing in environments with dense rewards or sparse rewards,this thesis conducts researches on improving exploration efficiencies of reinforcement learning algorithms.The obtained results may be meaningful for extensions of traditional RMAX exploration algorithm to low-dimensional continuous state spaces and learning of optimal policies in multi-goal environments with sparse rewards.Moreover,they may also be valuable in promoting reinforcement learning to solving real-life problems.
Keywords/Search Tags:the balance between exploration and exploitation, reinforcement learning, PAC optimal exploration, sparse reward, multi-stage, multi-goal environments with sparse rewards
PDF Full Text Request
Related items