Reinforcement Learning(RL)is an important research direction in Machine Learning(ML).RL is also considered a crucial way to achieve general Artificial Intelligence(AI).The key difference between RL and other ML domains is RL has an active learning process.The agent collects experiences by interacting with the environment and learns the value function and policy by maximizing the cumulative rewards.The central problem of RL is to improve the exploration efficiency of the agent.In the finite state-action space,exploration algorithms use state visitation count and confidence bound to obtain Probably Approximately Correct(PAC)guarantees.However,these methods cannot be directly applied in the Deep RL domain that has high-dimensional state space.In highdimensional space and sparse reward settings,the agent aims to explore a large state space with the lack of reward guidance,thus requiring strategic exploration methods to guide the agent to explore the unknown states and actions.Meanwhile,since the multimodality and stochasticity of the environment usually deteriorate the exploration,the agent needs robust exploration algorithms.In addition,different from the single-goal RL problem,the multi-goal exploration task couples the state space and goal space,which enlarges the whole exploration space and makes the agent need specific multi-goal exploration methods to learn a multi-goal policy.Improving the exploration efficiency of RL is an important way to handle high-dimensional space,solve the sparse-reward problem,settle multimodality and stochasticity of dynamics,and learn multi-goal policies.This paper researches on exploration efficiency of RL from four perspectives,including uncertainty measurement,multimodality,robustness,and multi-goal learning.The different perspectives focus on different challenges in exploration to improve the decision ability of RL algorithms.The research contents are as follows.First,we propose the uncertainty measurement theory and backward induction method for large-scale exploration tasks.We use epistemic uncertainty to measure the lack of knowledge of the environment in high-dimensional observations.We use bootstrapping network to measure the epistemic uncertainty and episodic backward update to improve the sample efficiency.The bootstrapping method uses the Bayesian posterior to estimate the posterior distribution of value function,which provides a general uncertainty estimation in high-dimensional space.Further,the use of the optimism principle encourages the agent to explore areas with high uncertainty.Theoretical analysis of the proposed uncertainty is also given.The episodic backward update has contraction properties in-value update.We propagate future uncertainty in a time-consistent manner through the episodic backward update,which exploits the theoretical advantage and empirically improves the sample efficiency.Our experiments in image-based tasks suggest that the proposed method gives reliable uncertainty measurement and improved sample efficiency.Second,we propose the variational dynamics theory and method to learn multimodal dynamics.Our method considers the environmental state–action transition as a conditional generative process by generating the next-state prediction under the condition of the current state,action,and latent variable.We derive a theoretical bound of the environmental transition as the learning objective.The latent variables encode information on the multimodality and stochasticity of the underlying dynamics through multiple samples.The model is trained by optimizing the variational bound through the stochastic gradient and re-parametric tricks.Compared to the existing methods that consider multimodal exploration as the expected single-modal exploration,the proposed model provides a better understanding of the dynamics and learns the multimodality in high-dimensional imagebased tasks.Further,we propose an intrinsic reward to conduct self-supervised exploration based on the learned latent space with convergence guarantees.The experimental result shows that the proposed variational inference method learns the complex dynamics and improves the exploration efficiency in multimodality tasks,and also performs well in real-world tasks.Third,we propose the dynamics bottleneck theory and method to learn robust representation to handle stochasticity in exploration.The representation learning follows the Information-Bottleneck(IB)principle.The goal of representation to acquire dynamicsrelevant information and discard dynamics-irrelevant features simultaneously.To handle mutual-information maximization in high-dimensional space,we theoretically derive the predictive objective and contrastive objective to estimate the mutual information.Based on the robust representation,we further construct an exploration bonus based on the information gain.We prove that the DB-bonus are closely related to the provably efficient UCB-bonus in linear MDPs and the visiting count in tabular MDPs.The exploration directly utilizes the information gain of the transitions,which filters out dynamics-irrelevant noise.The experiments are conducted in image-based tasks with different noises injected.Results demonstrate that the proposed method learns meaningful representation and is robust to dynamics-irrelevant noise.Fourth,we propose the bias-correction theory and method to handle multi-goal exploration.The previous method treated a hindsight goal from failed experiences as the original goal to receive rewards frequently.we analyze the hindsight bias in multigoal RL caused by hindsight goals theoretically,and formally describe the optimization objective and bias of multi-goal learning.The use of hindsight goals changes the trajectory probability,which brings hindsight bias to the optimization objective in multi-goal RL.According to the generation and change of hindsight bias,a bias-correction method based on importance sampling is proposed.The existing policy network is used to calculate the hindsight bias without introducing additional calculation modules.We use a causal inference theory to explain our method theoretically.We improve the stability of bias correction by using bias clipping and batch projection.The experimental results show that the bias-corrected multi-objective learning algorithm outperforms several baseline methods in complex manipulator grasping tasks,without additional computational cost.This paper studies the exploration efficiency of RL from uncertainty measurement,multimodality,robustness,and multi-goal learning.The proposed methods improve the exploration efficiency of agents in high-dimensional state space and sparse reward settings.The proposed exploration methods play important roles in solving the exploration problems in the multimodal,stochastic,and multi-goal environment,and improve the exploration efficiency of agents in complex scenes in theory and practice. |