| Reinforcement learning is an important model-free learning algorithm,which is suitable for complex optimization decision-making problems that are lack of prior knowledge or highly dynamic.However,the traditional reinforcement learning research is based on the Markov decision process,mainly for finite discrete space.The information related the states and actions is stored in a table,and the value function is calculated and updated by table lookup method.When faced with some continuous tasks,the tabular algorithm is difficult to obtain ideal learning results even if discretization is carried out.Approximate reinforcement learning provides an effective solution to such problems by the idea of function approximation.It approximates the value function or the strategy function,then obtains the optimal behavior strategy by learning the parameters of the approximator.But in practical application,most of the algorithms have problems such as low learning efficiency,slow convergence speed and weak adaptability.Therefore we conduct the research based on the existing algorithms to study above problems.The main research work includes:(1)To solve the problem that the direct gradient method cannot converge in some conditions,a neural network Q learning algorithm based on the residual gradient method is proposed.During the training process,the Bellman residual is utilized to set the objective function,and the gradient descent method is used to update the weight parameters to ensure the convergence of the system.In reinforcement learning,samples generated online are discarded immediately after they are used up,which is wasteful and not conducive to experience accumulation.Therefore,the samples are stored in an experience cache,and a certain amount of them is selected for gradient update by random sampling to speed up learning.At the same time,Momentum correction is added to further stabilize the learning process.The simulation results show that the algorithm can achieve stable control performance after a little training,and the learning speed and success rate are better than the general algorithm.(2)To deal with the problem that the single-step reinforcement learning algorithms with nonlinear function approximation have low sample utilization and learning efficiency,the multi-step Sarsa control algorithm based on RBF neural network is proposed.In this algorithm,the RBF neural network is used to approximate the action-value function,which can learn fast and avoid local minimums.At the same time,the experienced historical state information is recorded by the eligibility trace to complete multi-step update,thus the sample utilization is more efficient and the convergence rate is improved accordingly.Then,we assign the probability weight to each value function,so as to optimize the action selection strategy to further improve the learning efficiency.The simulation results show that the proposed algorithm can well combine nonlinear function approximation and multi-step reinforcement learning algorithm,and has fast learning speed and stable learning performance in continuous nonlinear control tasks.(3)To handle the problem that the typical value function approximation algorithm is only suitable for tasks with discrete actions,but most of the control tasks have continuous action space,the deep deterministic policy gradient algorithm based on prioritized experience replay is proposed.In this algorithm,the neural networks are used to approximate both the value function and the strategy function simultaneously,so the policy network directly outputs the determined action in each state,instead of the probability distribution of actions,and the tasks with continuous action are solved.In order to effectively reuse the samples,the stored samples are ranked according to the importance of different experiences.The more important the sample,the higher the sampling frequency,which is more beneficial to the learning process.The experimental results show that the proposed algorithm has fast learning speed and stable learning performance.It can control the manipulator to complete the grasping task of random dynamic targets.(4)To overcome the strong task correlations of the algorithms that designed for specific tasks,and expand the application scenarios of the algorithms,a sample transfer algorithm based on importance weight is proposed.This algorithm realizes the knowledge transfer from the old task to the new task by measuring and minimizing the sample distribution discrepancy between the source task and the target task.In addition,in the transfer process,weight coefficients are designed for each sample according to the contribution differences of different samples to improve learning efficiency.Experimental results show that the proposed algorithm achieves high classification accuracy in several cross-domain tasks,and the overall performance is better than many other related comparison algorithms.Moreover,the sensitivity experiments confirm that the algorithm is robust to different parameters,and can obtain better classification results with fewer neurons.In the dissertation,the value function approximation algorithm of continuous state space,the policy gradient algorithm of continuous action space and the reinforcement transfer algorithm of similar tasks are studied respectively.Several continuous control tasks prove that the proposed algorithms can effectively improve sample efficiency,accelerate convergence speed and enhance adaptability. |