| Multi-Target Search and Track(MTST)is an important application mode of unmanned aerial vehicle(UAV)swarms,which is widely used in the domains of environmental monitor,disaster rescue,border patrol,anti-terrorism,and emergency response,etc.As a core technology of UAV swarms,cooperative decision-making plays an important role in promoting efficient cooperation between the UAVs,thereby enhancing the MTST capability of UAV swarms.Traditional cooperative decision-making methods usually suffer from relying on accurate mathematical models,inefficient problem solving,and poor scalability,etc.,and the cooperative policies cannot evolve and learn continuously.Backgrounded on the MTST problem of UAV swarms,this dissertation studies the cooperative policy learning methods of UAV swarms based on Multi-Agent Deep Reinforcement Learning(MADRL),which should have the characteristics of swarm scalability,behavior reciprocity and communication learnability between the UAVs,etc.Given this,the main topics and contributions of this dissertation are as follows:(1)For the cooperative policies learning problem of large-scale UAV swarms,a Markov Decision Process(MDP)model of UAV swarms based on neighbor interaction is established,and a cooperative policy learning method based on the feature representation is designed,which improve the scalability of MADRL.This dissertation firstly models the MTST problem of UAV swarms.Due to the large scale of UAV swarms and the changes in the numbers of neighbors and observed targets of each UAV,the scalability of MADRL is solved from two aspects:decomposing the joint cooperative decision-making and processing each UAV’s variable-length input information.1)Since the scale of the UAVs may be large and indeterministic,the swarm-MDP model is established from the perspective of individuals based on the local communication and interaction between the UAVs,which makes individual decision-making complexity independent of the scale of the swarm and avoids the dimensionality disaster.2)The partial observability and dynamic change of the environment lead to the uncertainty of the length of each UAV’s input information,so the information cannot be directly input into the neural network with a fixed input dimension.To this end,the diagram-based feature representation method is proposed to represent the variable-length input information as fixed-dimensional features,and keep the features irrelevant to the permutation of the input information,so that the neural network can adapt to the changes in the number of neighbors and targets.3)Based on the experience-sharing training mechanism,the Parameter-Sharing Multi-Agent Dueling Double DQN(PS-MAD3QN)algorithm is designed to learn the sharing cooperative policy of the UAVs.The simulation results show that:compared with the three information encoding methods,the training time of the proposed method is greatly shortened,and the number of targets searched and tracked by the UAVs in the training scenario is increased,which verifies that the proposed method can learn the cooperative MTST policies for large-scale UAV swarms;and in scenarios with scales ranging from 5 to 1000,UAV swarms can search and track more targets and have better scalability;furthermore,under local communication,the UAV swarms can achieve equivalent MTST performance as the global communication using the proposed method.(2)For the difficulty of UAV swarms’ cooperation conflict resolution,a cooperation evaluating method based on the Pointwise Mutual Information(PMI)neural network estimating,as well as two cooperative policy learning algorithms based on maximizing the reciprocal rewards are proposed to improve the cooperation efficiency of UAV swarms.Due to the partial observability,the UAVs may conflict with each other when maximizing their private rewards,which is not conducive to the cooperation of UAV swarms.Thus,the dissertation proposes the maximizing reciprocal reward method to reshape each UAV’s reward according to the real-time cooperation between the UAVs,thereby rewarding and guiding the cooperation between the UAVs.1)The PMI is introduced to capture the real-time cooperation degree between the UAVs,and a neural network estimation method of the PMI is proposed.A simulation example is given to verify the effectiveness of this method in estimating the real-time cooperation degree between two agents.2)Each UAV’s reciprocal reward is defined as the weighted sum of its environmental reward and intrinsic reward,in which the intrinsic reward is defined as the product of the PMIs and the environmental rewards of the neighboring UAVs to encourage the cooperative behavior of the UAV.3)The policy gradient optimization processes are respectively derived with two reciprocal reward forms(reciprocal immediate and cumulative rewards)as the objective function,and the corresponding multi-agent actor-critic algorithms are proposed to learn the sharing cooperative policy of the UAVs.The simulation results show that:compared with directly maximizing the environmental reward,the proposed two algorithms can enable the UAVs to search and track more targets in the training scenario,and when the scale of the UAVs changes from 5 to 1,000,the UAVs can also search and track more targets,which verifies the effectiveness of the proposed methods.(3)For the problem of autonomous cooperative communication policy learning for UAV swarms,a communication policy based on the attention mechanism and the corresponding MADRL algorithm are proposed to improve the autonomous cooperative communication capabilities of UAV swarms.The communication protocols designed manually are usually insufficient in flexibility and universality,which limits the communication ability of the UAVs and is not conducive to their cooperation.Therefore,this dissertation adopts a new data-driven idea to explore how to adopt MADRL to learn autonomous cooperative communication policies for UAV swarms,instead of the traditional idea of manually designing.1)The autonomous cooperative communication policy of each UAV is modeled as a mapping function from its input information to the published communication message so that the UAV can autonomously determine the content of the message published according to its real-time status.2)A neural network is designed based on the attention mechanism to fit the communication policy,in which the attention mechanism can distinguish the importance of different messages and scale well to the dynamic changes of the local communication topology.3)Maximizing the rewards of the neighbor UAVs,the deterministic communication policy gradient optimization process in continuous communication space is derived.4)Based on the centralized training-decentralized execution framework,the autonomous cooperative communication-motion hybrid policy learning algorithm is proposed to learn the communication policy and motion policy simultaneously for UAV swarms.The simulation results show that:compared with the Local CommNet and Attentional Hidden algorithms,the communication policy and motion policy learned by the proposed algorithm can better promote the cooperation of the UAVs so that they can search and track more targets;further,when the communication failure probability increases gradually,the communication policy and motion policy are robust. |