| Robotic arm,as a class of robot system oriented to precise manipulation,bas been gradually applied to more practical scenarios.Due to the multi-joint high-dimensional dynamics of the robot arm and the complexity of the control objectives,how to obtain the optimal decision-making and control ability in high-dimensional state action space is a challenging problem to realize autonomous manipulation for the robotic arm.Model-based robot control methods require accurately modeling the operation environment and the task,which makes it difficult to achieve adaptive optimization for complex systems.Therefore,the study of machine learning methods that can use empirical data to au-tonomously optimize manipulation performance is of great significance to improve the intelligent and manipulative level of the robotic system.Reinforcement learning,as a class of machine learning methods for solving sequen-tial optimization decision-making problems,plays an important role in improving the au-tomation level of robotic systems.In particular,deep reinforcement learning which is based on deep neural networks has received extensive attention and development in recent years,providing an effective approach to achieve better end-to-end self-learning control property.However,existing deep reinforcement learning algorithms still face challenges such as low sample efficiency,slow convergence speed,and unstable training process in solving robot control problems,especially when meeting high-dimensional state spaces.This dissertation studies feature pre-training methods and policy optimization methods in deep reinforcement learning to overcome the mentioned limitations in robot autonomous grasping control problem.The research is carried out from three aspects such as efficient feature representation method,rapid policy evaluation method and efficient Actor-Critic deep reinforcement learning method.This dissertation analyzes the limitations of exist-ing algorithms,proposes corresponding solutions,and conducts sufficient simulation and experimental validation in robot grasping control problem.The results show that the pro-posed methods are significantly better than the existing algorithms in sample efficiency,learning stability and convergence speed.The specific research work and innovations of this dissertation are described as fol-lows:(1)In terms of feature pre-training for robot autonomous grasping tasks,a multi-task unsupervised feature representation method with a synthetic model(MURS)and a supervised feature pre-training method based on hierarchical Extreme Learning Machine(SHELM)are proposed respectively,which effectively improve the sample efficiency of end-to-end deep reinforcement learning under high-dimensional image sensing input conditions.MURS uses a stacked feature embedding structure that combines a residual network-based autoencoder with auxiliary tasks such as model prediction and reward pre-diction to extract task-relevant low-dimensional features from the original image;SHELM compresses high-dimensional image features into a low-dimensional state space by means of supervised learning,which is used to accelerate the training process of the policy.The proposed feature pre-training methods are tested in robot grasping control simulation by combining typical deep reinforcement learning algorithms.The results show that the fea-ture pre-training methods proposed in this dissertation can effectively improve the sam-ple efficiency of end-to-end robot learning control with pixel inputs,and has important application value for scenarios with limited perception or difficulties in accurate state measurement.(2)Aiming at the problems of local convergence of Least Square Temporal-Difference algorithm(LSTD)and low sample utilization efficiency of linear Temporal-Difference algorithm(TD),an policy evaluation method of Least-Square Truncated Temporal Dif-ference learning(LST~2D)is proposed.An adaptive algorithm truncation and switching mechanism is designed in this algorithm,which effectively combines the fast convergence property of LSTD algorithm with the asymptotic convergence property of linear TD algo-rithm.The LSTD algorithm is used in the early stage of policy evaluation,which approx-imates the value function with a faster convergence speed;when LSTD algorithm shows local convergence or saturation,it automatically switches to TD algorithm to approximate the state value function.The performance of the proposed algorithms is analyzed and dis-cussed theoretically.In addition,LST~2D algorithm is extended to the strategy evaluation problem under image input conditions by combining the feature pre-training method pro-posed in this dissertation.Through robot grasping control simulation and experimental verification of the UR5 robot,the results show that the LST~2D algorithm can effectively improve the convergence speed and accuracy of policy evaluation under the condition of state measurement and high-dimensional image input,and the performance is better than that of LSTD algorithm and TD algorithm.(3)An Actor-Critic deep reinforcement learning method based on pre-trained fea-tures and LST~2D(ACPFL)is proposed to improve the learning efficiency and stability of the deep reinforcement learning method.Unlike the existing Actor-Critic deep augmenta-tion learning methods,the Critic part of ACPFL algorithm adopts LST~2D algorithm based on the linear value function approximation structure,which improves the convergence speed and approximation accuracy of the value function approximation.The convergence of the ACPFL algorithm is analyzed theoretically using the two time-scale Actor-Critic method.Based on this,ACPFL method,which is combined with feature pre-training methods,is extended to efficient deep reinforcement learning method under image ob-servation condition(ACPFL-MURS);and deep reinforcement learning method based on sparse kernel feature representation under state measurement condition(ACPFL-Kernel).Simulation results in the robot autonomous grasping control task show that the proposed algorithm outperforms deep reinforcement learning algorithms in discrete action space such as PPO and DQN in terms of learning efficiency and stability.(4)A soft Actor-Critic deep reinforcement learning algorithm based on sparse-kernel representation and LST~2D is proposed(KTSAC),which effectively improves the explo-ration efficiency and stability of the Actor-Critic deep reinforcement learning method in high-dimensional continuous state and action spaces.First,KTSAC algorithm achieves automatic construction of continuous state-action feature representation by performing kernel-based quadratic sampling in the sample space;On this basis,KTSAC combines the LST~2D algorithm with soft Bellman residual learning mechanism under the linear val-ued function approximation condition,which makes the policy evaluation process of the Critic hold fast convergence and asymptotic stability.The convergence of the soft Bell-man residual learning under linear-valued function approximation condition is analyzed and proved theoretically.Through the grasping simulation and experimental tests based on the VREP simulation platform and the real UR5 robot respectively,it is effectively illustrated that KTSAC can help the robot learn the grasping control policy autonomously without relying on the model,and the learning efficiency is better than the deep reinforce-ment learning algorithms in the continuous action space such as SAC and TD3. |