Recently,deep reinforcement learning has been successfully applied in many sequential decision-making problems,such as go,video games,robotic control and dialogue systems,by combining reinforcement learning and deep learning.Reinforcement learning models the sequential decision-making problem into a Markov decision process and designs the algorithms to solve two major problems: policy evaluation and policy control.Deep learning makes it possible to automatically extract the feature of data in the real-world problems by using the powerful expression capabilities of deep neural networks.Therefore,by combining these two techniques,deep reinforcement learning has ability to learn and improve itself when dealing with complex real-world problems.In this thesis,we study the algorithms in deep reinforcement learning from three perspectives: policy evaluation problems,policy control problems,and deep learning.Policy evaluation is one of the foundations of the reinforcement learning.The goal of the policy evaluation problem is to design algorithms to evaluate the value of a given policy.After we can evaluate a policy,the policy control problem is to design algorithms to find the optimal policy.Efficient learning algorithms for deep neural network are also important for the success of deep reinforcement learning.The purpose of this thesis is to analyze,understand and improve deep reinforcement learning algorithms through the application of mathematical tools,which is of great significance for the further development of deep reinforcement learning.The first part of this thesis focuses on the policy evaluation problems.Our work is the first to solve the open problem about the convergence rate of the commonly used GTD algorithms in the policy evaluation problems when data have Markov property instead of the identical and independent distribution.To achieve the goal,we first use mixing time as a tool to depict the non-i.i.d.property of the data distribution.Second,we design a novel error decomposition for the general saddle point problem,and prove the convergence rate for the problem when using the stochastic gradient algorithms and the data are non-i.i.d.distributed.Finally,we extend the results to the GTD algorithms.We provide a rigorous convergence rate analysis for the GTD algorithms and discuss the factors that influence the convergence from different aspects.The second part of this thesis focuses on transfer learning in policy control problems.One important problem in the policy control is how to perform transfer learning,which helps accelerate the learning process in the new tasks by using the knowledge learned from old tasks.Most of the previous works are empirical methods.Without theoretical analysis,the effectiveness of the algorithm cannot be guaranteed,which may potentially lead the negative transfer.Another problem of the previous works is that most of them do not take advantage of reinforcement learning algorithms’property.We theoretically investigate the convergence of the Q-learning algorithms and find that the target Q function has much influence on the convergence rate.Therefore,we propose the target transfer Q learning algorithms which transfer the target Q function.To avoid the negative effect by the improper transfer,we design the error condition based on the theory.We prove that by verifying the error condition,target transfer Q learning can guarantee the effectiveness of the transfer.Finally,we test the practicability of the algorithm in various experiments.The third part of this thesis focuses on the learning of deep neural networks in deep learning.The high degree of non-linearity makes the learning of deep neural networks difficult.Previous works have found that the original parameter space of feedforward neural networks is relatively redundant,which brings additional difficulties to the optimization of feedforward neural networks.In deep reinforcement learning,non-feedforward network structures such as recurrent neural networks and attention networks are also widely used.Considering previous work is restricted to the feed-forward network structures and in order to put previous works into practice,we study other commonly used non-feedforward network structures,such as recurrent neural networks,attention networks.Our work can be divided into three steps.Firstly,we analyze the path representation for the neural networks.Secondly,we prove that a subset of all paths can represent all paths and the paths in the subset are independent of each other.Additionally,we design an effective algorithm to find the subset.Thirdly,in order to apply the theoretical results,we design optimization algorithms in the new parameter space,taking into account the effectiveness and efficiency of the optimization algorithms,and verifying the performance of the algorithms on a variety of datasets. |