Reinforcement learning(RL)focuses on goal-oriented learning from interactions,in which the agent learns from interactions with the environment to achieve goals or maximize cumulative rewards.With the help of deep learning technology,deep RL has been widely applied to continuous control tasks that are common in practice,and achieved impressive performance with a single goal and carefully designed rewards.However,in reality,continuous control tasks are often with multiple goals.Typical RL algorithms rely on reward functions and cannot successfully apply the policy learned for one goal to another.Therefore,the multi-goal nature significantly limits the application of the algorithm in continuous control tasks.In this case,,multi-goal RL using goal-conditional policies and goal-achieving rewards has become a feasible solution.Multi-goal RL aims to achieve and generalize over a series of different goals and can be combined with hierarchical reinforcement,planning,and other learning methods to achieve stronger decision-making capabilities.However,there are still problems in applying multi-goal RL to continuous control tasks that are difficult to solve,especially the sparse rewards problem.How to achieve the desired goal as much as possible in a multi-goal continuous control task with sparse rewards is an urgent problem to be solved for multi-goal RL.We focus on the challenge of multi-goal exploration,and exploitation in different tasks,and propose to plan and achieve curriculum-based guided subgoals.A series of multi-goal RL algorithms are studied within the framework of multi-goal exploration-replay,achieving effective exploration and policy learning in various tasks,from short-horizons to long-horizons,online to offline,and improving sample efficiency of multi-goal continuous control with sparse reward.The main contribution is summarized as follows:·We propose a hindsight multi-goal experience replay method against unreliable value estimation.Hindsight Experience replay is the key for multi-goal agents to learn from sparse rewards,but it inevitably shifts the distribution of goals in the replayed experience,which brings extrapolation estimation errors and hindsight bias to the multi-goal value estimation.Based on the analysis of these errors and bias,we propose two methods to counter them:for estimation errors of pseudo-goals,we construct an experienc e sampling sequence that can achieve the distribution shift from achieved goals to desired goals,so as to realize the multigoal experience.replay guided by the curriculum,and reduce the off-policyness between experiences and the current policy;for hindsight bias,we propose a surrogate learning objective on the relabelled experience and uses a hindsight reward that varies with the distribution divergence to optimize the surrogate objective.·We propose a multi-goal exploration method via exploring successor matching.When the initial state of the agent is too far from the desired goal,the signal of goal-achieving will show exponential attenuation.It is difficult for the agent to obtain signals to guide learning,even with hindsight experience replay.To address this issue,we guide the agent to reach the desired position gradually by setting intermediate sub-goal as milestones and reaching milestone sub-goals in sequence.Specifically,we adopt the idea of successor features and propose a milestone-matching method based on successor features.The sub-go.al that is more likely to be achieved through successor feature matching is regarded as a milestone.Compared with general state features,successor features that indicate future state occupations can be reused by different goals,and stable training can be carried out in the case of sparse rewards.The method embeds the representation learning into a multi-goal exploration-replay framework,proposes successor matching,and gradually encourages the multi-goal exploration towards promising milestone frontiers.·We propose a curriculum goal-conditioned self-imitation method on a fixed dataset.In of offline scenarios with no exploration and scarce rewards,’it is difficult for the agent to learn from sub-optimal trajectories in offline datasets or experience buffers.In response to the above problem,we adopt the idea of learning from the potentially high-return stitchings of sub-trajectories.The agent splits existing trajectories and stitches the resulting sub-trajectories to obtain potential,highreward trajectories.By self-imitating the generated curriculum sub-trajectories,it guides potential sub-trajectories stitching towards an ideal future state,which enables it to learn optimal policies from suboptimal data.This method not only avoids the severe off-policy distribution shift problem in offline learning but also provides demonstrations for multi-goal agents via augmenting and filtering fixed experiences with universal goals including future states and accumulated rewards. |