Font Size: a A A

TD3 And Physics Simulation Reasoning Based Open Task Solving

Posted on:2021-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:S T ChaiFull Text:PDF
GTID:2518306569496614Subject:Software engineering
Abstract/Summary:PDF Full Text Request
How to help the agent to utilize the learnt policy during exploring in th environment to solve open tasks and generalize the policy to new environments is studied.The open tasks often have sparse rewards and huge exploration space,so a better exploration strategy is needed to speed up the learning process of the agent.To ensure that the learnt knowledge is not limited to a specific task,and help the agent to get the sparse reward,intrinsic rewards are introduced and used to encourage the agent to explore in the environment.A task which requires the end effector of a manipulator to reach a specific body is designed with the help of the robot learning framework Pyrobolearn.New reinforcement learning algorithms are proposed to train an agent to learn in a task-agnostic environment without reward related to the task,where the learnt policy of the agent should be generalizable to the proposed task.Based on TD3 and HER algorithms,a locality sensitive hashing based counting reward is introduced to encourage the agent to explore in the unknown environment,where the LSH method is used to discretize the state space and count the number of visits.If the number of visits is high,then give the agent less reward to encourage the agent to visit the less visited states.Based on HER algorithm,a trajector replacement strategy is proposed.Trajectory replacement strategy replace every failed trajectory to get a trajectory which is more likely to success,and store the replaced trajectory and the original trajectory in the replay buffer for training.Trajectory replacement strategy doesn't require the state vector to be consisted of achieved goals and desired goals,so it can be adapted to broader cases and more complex tasks.A mixed gaussian noise layer is proposed to provide a adaptive policy noise.Mixed gaussian noise can be obtained by using a fully-connected layer to rescale a gaussian noise vector.The mixed gaussian noise then is used to convert the deterministic action produced by the actor network to randomly exploring action,where the fully connected layer weights which decides the quality of exploration is learnt automatically with back-propagation.Experiment results indicate that the agent with mixed gaussian noise converged to a higher environmental reward and higher success rate.A reward based on physics simulation time is proposed.Because that solving open tasks often is related to complicated dynamics of the environment and simulating this will cost more time,the simulation time can be used as a reward to help the agent to improve its exploration strategy and generalizability.The simulation time reward can be used to encourage the agent to learn a generalizable policy in a task agnostic environment,and adapt to a new policy after given a task setting with sparse reward.Experiment results indicate that simulation time reward is correlated with the task-specific reward,and it can be used to improve the exploration strategy.
Keywords/Search Tags:Meta Learning, Reinforcement Learning, Robotics, Physics Simulation, Sparse Reward
PDF Full Text Request
Related items