With the rapid development of intelligent technology,robots are gradually applied to people’s production life,and robot operation skill learning has become a research hotspot.In the face of increasingly complex tasks,traditional machine learning is no longer sufficient for robot operation skill learning.Deep reinforcement learning is applied to solve this problem,and the robot interacts with the environment through a control policy network to achieve autonomous learning and autonomous decision making of robot operation skills.In reinforcement learning with sparse rewards environment,it is difficult for the agent to obtain samples with positive rewards,leading to problems such as slow convergence of network model training.In this thesis,we improve the existing deep reinforcement learning algorithm in a sparse reward environment,aiming to improve the robot’s decision making ability on complex tasks.The main research of this thesis is as follows.First,a mutual learning and Director network algorithm(Knowledge Distillation Mutual Learning Director Network,KDMLDire)based on dual-policy knowledge distillation is proposed to address the problems of slow convergence and overestimation of network models for deep reinforcement learning.First,a dual-policy network is used to improve the quality of data samples by selecting actions with high value based on the respective fitted actions during the exploration environment;then,the dual-policy network accelerates the convergence of the policy network based on knowledge distillation during the training of the network model;finally,a Director network is added to classify the sample data of the e00 xperience replay based on the reward value,and the Director network is trained using a supervised learning approach,thus solving the overestimation problem caused by reinforcement learning.Second,a Meta Generative Intrinsic Reward(Meta Generative Intrinsic Reward,MGIR)algorithm based on hierarchical meta-learning is proposed for the problem that deep reinforcement learning is still inefficient for complex tasks under sparse rewards.First,hierarchical learning combined with meta-learning is used on complex tasks to quickly adapt to learning in new tasks using the knowledge learned in simple tasks;then,the Generative Intrinsic Reward module is introduced in the actor-critic algorithm,using the standard reward data from reinforcement learning combined with the data from the generated intrinsic rewards to encourage the intelligences to explore novel states that have not been visited;finally,these data are used for training to improve the stability of the algorithm.Finally,the application of the KDMLDire algorithm and the MGIR algorithm to robot manipulation skill learning is completed on the Fetch task of the MuJoCo simulation environment.Under the sparse reward environment,the proposed algorithm in this thesis is compared with existing offline policy reinforcement learning algorithms for experiments,and the results show that the proposed algorithm outperforms other algorithms both in terms of training efficiency and success rate. |