| In recent years,with the development of science and technology,cloud computing,big data,data center,virtual reality technology,5G,artificial intelligence(AI),Internet of Things,fiber optic sensing and other emerging industries have been emerging.These developments are changing the way we live and simplifying tasks that were difficult to accomplish in the past.At present,the surface unmanned vehicle(USV)is widely used in military,industrial and many other fields.They can replace humans in dangerous and special environments to complete a variety of difficult or dangerous operations.The safety of the surface unmanned vehicle is the basic requirement when it meets the emergency in the mission process.However,this task has become more difficult as the maritime environment has become more complex.Therefore,path planning and obstacle avoidance of surface unmanned craft have become a research hotspot in recent years,and how to coexist harmoniously with people and other subjects in such a highly dynamic environment is one of the key issues.Aiming at the obstacle avoidance and path planning problems of the surface unmanned vehicle,through the custom design of state space and motion space,this paper proposes three algorithms to solve the path planning problems,which are double DQN algorithm,DQN algorithm based on sorting priority sampling and DQN algorithm based on tree priority sampling.The three algorithms will be applied to the path planning and obstacle avoidance of the agent,and compared with the DQN algorithm.The algorithm is based on reinforcement learning framework,and the neural network is trained by interactive"trial and error" with the environment.Experiments show that the algorithm proposed in this paper based on reinforcement learning theory can achieve the task goal of path planning.The contents are as follows:1.In order to make the estimation of the output actions of the neural network more stable,this paper proposes the Double DQN algorithm for USV path planning.On the basis of DQN,it eliminates the overestimation through the two steps of decoupling the Q value action selection and the target Q value calculation.In the Double-DQN algorithm,instead of directly finding the maximum Q value of each action in the target Q network,it first finds the corresponding action with the maximum Q value in the current Q network,and then calculates the target Q value in the target network with the selected action.The algorithm can realize end-to-end policy selection from input to output.The USV position information is input,and through the perception of the position information by the neural network,the action with the largest Q value in the action space is selected to execute,so that the strategy with the largest cumulative reward is the optimal strategy,so as to realize the USV autonomous obstacle avoidance and path planning.Under the condition of consistent hyperparameters,accumulative rewards are used as evaluation indexes.Simulation results show that Double DQN algorithm begins to converge at around 15000,1200,20000,and 18,000 steps respectively,which are better than DQN algorithm.2.DQN network model for the pool to extract experience way disturb the correlation of the sampling process,to train the neural network,at the same time,and in the "first in first out" sampling way pool to update the experience,which caused two problems,one is a large number of similar experience memory footprint pool,difficult to convergence,2 it is to make the neural network training time is long,time cost is increased.Therefore,this paper proposes a sorting priority experience playback DQN algorithm,which uses "TD-Error" method to update the experience pool,and focuses on learning behaviors with large difference values to increase the efficiency of behavioral learning.Meanwhile,in order to prevent overfitting,the importance-sampling mechanism is introduced to adjust the updating model by reducing the weight of common samples.By setting different training rounds on the same map,the cumulative steps of different rounds were observed as the evaluation index.The simulation results show that,compared with DQN algorithm,the cumulative steps of the proposed algorithm in 30,50,and 70 rounds are 16884,21034,29723,respectively,which are less than DQN algorithm.3.In order to solve the problems that the uniform sampling method does not make full use of the stored information and that the action selection in the training process is too random and the convergence speed is too slow,this paper proposes A third algorithm,that is,based on the tree-sampling Dueling DQN algorithm,the value function Q is decomposed into state value function(V)and dominance function(A).A major benefit of this decomposition is that it generalizes the learning between actions without changing reinforcement learning algorithms.At the same time,the absolute value of TD-error is directly used as the priority index for priority sampling in the sampling process.Then,the network model is built and the experiments are carried out on four maps respectively.The simulation results show that the proposed algorithm converges in four paths at about 12,000 steps,13,000 steps,14,000 steps and 15,000 steps,and the loss value,both are superior to DQN algorithm. |