With the continuous development of science and technology,applications of unmanned aerial vehicles(UAV)gradually expanded,which virtually puts forward higher requirements for intelligence on UAV.In the future,UAV should be able to complete obstacles avoidance,path planning and other common tasks by interacting with the environment independently,rather than absolutely relying on hand-coded programs.Reinforcement Learning(RL)is a feasible technical route for the complete autonomy of UAV and has already been employed in games and other tasks that are difficult to learn.However,the high online operating pressure and large quantity of interactive training data for RL hinder a broader application,especially in the field of UAV.In view of this situation,this dissertation researches on RL and its application in path planning of UAV and tries to arm the agent with some expert knowledge before learning starts to improve the practicality of RL.The research is mainly carried out from two aspects:firstly,taking expert knowledge about the task into account,the dissertation tries to reduce the computational complexity of RL by introducing batch recursive least squares or special basis functions;Secondly,from the perspective of transfer,the dissertation focuses on methods of reusing knowledge contained in demonstrations of source tasks,and further explores the application of RL aided by demonstration knowledge transfer in UAV’s path planning.The main research work and contribution are summarized as follows:1.The dissertation reviews the current research status of RL and its applications systematically,focusing on research of RL aided by expert knowledge to make up the blindness of tabula rasa learning,especially the part that is combined with transfer.In addition,the dissertation summarizes difficulties of transfer’s application in RL and proposes a framework of transferring knowledge from demonstrations of source tasks to new tasks in order to accelerate RL.2.As in Actor-Critic structure of RL,the calculation complexity of estimating the natural gradient is several times bigger than that of estimating the general gradient,this dissertation raises the idea of batch recursive to effectively reduce the online operation pressure.During the process of RL,after the quantity of accumulated data has reached the designed value,the agent can estimate the natural gradient with recursive least square and times of natural gradient’s calculation is reduced.At the same time,as the estimate is more accurate,agent could increase the update step size of strategy parameters to ensure that convergence speed is not greatly affected.In general,batch recursive makes the agent process interactive data flexibly with acceptable online operation pressure.3.The dissertation proposes special basis functions to approximate the symmetric state value function and policy when symmetry exists in state action space.When constructing basis functions,expert considers the center point and its symmetric position information simultaneously to make the value of basis functions equal in symmetrical state and that state value is also equal.Nevertheless,as the quantity of special basis functions is small compared with that of regular basis functions,speed of RL is accelerated obviously and operating pressure is reduced greatly.4.As the quantity of demonstration trajectories is limited but the series of a single trajectory is long,it is difficult for machine learning to classify and the dissertation raises a new classifier by combining dynamic movement primitives(DMP)with convolutional neutral network(CNN).At first,the algorithm takes trajectories as independent multi-dimensional time series and parameters of corresponding DMP is considered as characterization of each single dimension and then CNN whose kernel function is transformed is used to explore the structural change information of DMP’s parameter sequence and to classify.At last,labels of different dimensions are integrated to determine the final type of the entire time series.5.As expert can demonstrate simple tasks easily and RL is suitable for solving difficult tasks,this dissertation proposes multiple methods to reuse expert knowledge contained in demonstration trajectories of related source tasks in the learning of more difficult tasks.The dissertation tries to mine expert knowledge contained in demonstrations from multiple perspectives and then transfers knowledge to the new task through the agent space or inter-task mapping to guide the exploration process of learning,encourage the agent to explore some states,or even act as the initial strategy directly so as to accelerate the learning speed.6.In order to apply RL in UAV’s path planning,the dissertation proposes to guide agent’s exploration with a reconstructed trajectory.Expert demonstrates a number of tasks whose parameters are known to arm the agent with different obstacle avoidance skills firstly and when the agent encounters a similar situation,it can generalize a new collision free trajectory through the constructive relationship of tasks.Finally,the trajectory order is used to construct the potential function and Q learning is trained to obtain a good policy.The algorithm reduces the number of learning failures and verifies the feasibility of RL’s application in path planning of UAV. |