As one of important learning tasks in machine learning research,reinforcement learning(RL)is an agent-oriented learning.In RL tasks,the agent interacts with the environment to learn the mapping relationship between action and state and to achieve goal of the task.RL process is seeking the optimal policy based on the interactions and learnings algorithms.The RL methods for optimal policy can be divided into value function based methods and policy gradient methods.The explicit expressions of the action policies of the agent is unnecessary for value function based algorithms;instead,the optimal value functions through continuously updating the cumulative reward expectation of the state-action pair are necessary.Value function based methods have great difficulty in solving the stochastic policy problems.For example,when suffering dimension disaster problem,a linear function approximation model is unable to keep the convergence of the algorithm.On the contrary,since a policy gradient method can clearly represent the function of the policy,the parameter vector of the policy function can be optimized gradually along with the direction of the policy gradient and finally approaches the optimal solution for the tasks.The training by policy gradient methods has good convergence and thus becomes a research hotspot.Another merit of policy gradient methods is that the action can be generated as several probabilistic representation forms.However,the drawback of policy gradient algorithms is that the convergence ratio is slow.Therefore,combining the advantages of these two kinds of algorithms can effectively improve the convergence performance and stability of the algorithms.The starting point of this paper is using the action-critic network as the framework and training via policy gradient algorithms.Considering the low convergence ratio and weak stability throughout training process,a fast and stable policy gradient algorithm is proposed based on adaptive learning rate designing.Based on the above analysis,the main research and contributions of the thesis can be summarized as follows:1.Based on the action-critic network structure,the natural policy gradient actor-critic algorithm(NAC)is used in the discount model to improve the convergence ratio and stability of the algorithm.At the same time,in order to solve the problem of time-consuming and inaccurate adjustment of hyperparameters in the network,This paper chooses to use the Adadalta algorithm to adaptively adjust the hyperparameters related to the learning rate in the action network and further improve the convergence speed and stability of the Adadelta-natural policy gradient actor-critic algorithm(A-NAC).The experimental results show that the algorithm A-NAC has better learning efficiency and higher convergence ratio than the regular policy gradient method..2.Many real-world learning tasks have continues state and action spaces and the input-output dimensions are relatively high,such as the robot control(multiple degrees of freedom),the angle control of the steering wheel of cars,throttle size,weather forecast recommendation index,and so on.Considering this kind of learning tasks,the deterministic policy gradient(DPG)algorithm is used to solve the continuous reinforcement learning problem.Natural gradients with deterministic policy gradient algorithms are combined to improve the stability of the algorithm during learning processes.In order to enhance the exploration,the off-policy learning technique is introduced to learn the deterministic policy from the exploration action policy.Experiments on benchmarks proved that the natural-deterministic policy gradient algorithm(N-DPG)in the high-dimensional action space performs significantly better than other algorithms. |