Agents refers to any actual object that can perceive the environment through sensors and act on the environment through actuators.Aircraft,unmanned vehicles and robots in the real world can be called agents.This paper will take high-speed aircraft as agent.High-speed aircraft has great advantages in penetration operations because of its high flight speed and altitude.At present,traditional planning algorithms in penetration operations can no longer meet the needs of future intelligent operations.The methods with higher intelligence,greater universality and higher real-time performance will be the future development trend.Reinforcement learning(RL)is undoubtedly the method that best meets these needs at this stage.Therefore,this paper deeply explores the application of reinforcement learning to the task assignment problem of multiagent.The task assignment problem is a combination of the two sub-problems of target specification and path planning.This paper starts from the path planning problem in the case of a single agent,and gradually expands it to the problem of target specification and path planning in the case of multiple agents,which designs a feasible method for extending single agent to multi-agent.The main work of this paper includes the following aspects:Firstly,in view of the characteristics of the combat environment of high-speed aircraft,this paper establishes a Markov decision process(MDP)model to describe the task of a single aircraft autonomously avoiding threats and reaching near the designated target.The state space in the model contains the location information of all objects that threaten the safety of the aircraft,and uses the attention mechanism to model the agent to learn the mapping relationship from state to action.The training of the agent uses model-free reinforcement learning series of algorithms.At the same time,an action filter is designed to ensure the time relevance of the agent’s decision-making actions.In the experiment,the MDP model established in this paper shows a better comprehensive performance than the common partially observable Markov decision process(POMDP)model,and it has a dense distribution of threat objects and dynamic threat objects.The environment also showed good generalization performance.Secondly,considering the reliability of the RL method,this paper combines RL and the cross-entropy method(CEM),and proposes the RL-CEM algorithm to optimize the trained agent model online.The RL-CEM algorithm learns an auxiliary decision-making model through the CEM algorithm.The model takes the action of the agent from the observed state as input and outputs an optimized action.The algorithm realizes the correction of the wrong behavior of the agent while ensuring the real-time performance of the algorithm.In the comparative experiment,RL-CEM showed a higher penetration success rate than RL method,and expanded to different sizes of combat airspace,different numbers and sizes of threat objects,dynamic threat objects,and easy to fall into the local area.A very satisfactory penetration success rate can still be guaranteed in an excellent environment.Finally,this paper decomposes the task assignment problem of multi-aircraft into two subproblems:target assignment and task execution.Mission execution means that each aircraft needs to successfully penetrate defenses and reach their respective targets.Based on the MDP model and the agent model established in the case of a single agent,this paper combines centralized training with distributed execution(CTDE),parameter sharing(PS)and value decomposition(VD)are used to establish a multi-agent model.By extending phasic policy gradient(PPG)algorithm,PPG-PSVD is proposed to solve the multi-agent reinforcement learning(MARL)problem.The target is specified by the heuristic multi-target dynamic allocation algorithm proposed in this paper.This method constructs a value matrix based on the state value function learned by the agent and the Euclidean distance between the aircraft and the target,and then uses the Hungarian algorithm to calculate the value matrix to solve the optimal policies of target assignment.This paper compares the PPG-PSVD algorithm with the current cutting-edge algorithm of MARL.PPG-PSVD shows excellent sample efficiency in the MARL problem of this paper.At the same time,the heuristic multi-objective dynamic allocation algorithm also shows the feasibility of avoiding local optima. |