Fierce antagonism,super real-time response,and incomplete information are prominent characteristics of modern warfare,which makes it difficult to design the operational process branch,planning decision branch,and emergency disposal scheme in advance.Thus,the solutions to cope with situation indeterminacy and to improve efficiency and accuracy have emerged as key topics for a mechanism of winning modern wars.As an effective means of studying war,wargame has been widely used in operation training and approach evaluation areas.Research on contingency planning of wargames,which can be verified by operational experiments,has been an important way to develop new contingency technological means.Traditional contingency planning used to rely on commanders’ experience,predesign decision rules,and operational optimization prior-knowledge models,then using an expert system to conduct re-planning based on the real-time situation.While,high dependence on prior knowledge,makes indeterminacy,caused by the continuous evolution of modern warfare,a tough issue for traditional methods.The development of intelligent planning techniques like Reinforcement Learning(RL)in recent years,especially the successful implementation of Alpha Star and Alpha Go,bring us a new way of contingency planning.But some problems still need to be worked out.Operation application,take wargame as an example,requires better situational awareness and structured control,which are much more duplicated than environments like Star Craft and Go.These also make RL difficult to converge in the training stage and hard to explain in exploiting stage.For the above problems,the paper focuses on “prior knowledge + Deep RL”wargame contingency planning,merges traditional expert systems and modern trial-anderror learning methods,and puts forward a knowledge model embedding Reinforcement Learning methods research framework.The paper presents two ways of embedding prior knowledge into RL,one is the time-segmented invocation,and the other is parameters optimization.These two methods not only inherit the interpretability and generalization performance of the prior knowledge model but also have the representational ability and dynamic adaption of RL.And experiments in RL open beta test platforms and wargame platform show the two methods have positive and efficient effects on speeding up the training process,improving strategy rewards,and explanation of RL.The methods were applied in National Wargaming Competition and won a championship in 2019 and 2020.The main work and innovation points of this paper include:(1)We construct a framework that merges embedding prior knowledge models and RL methods.Starting from a three-elements-level task contingency planning structure with two types of sub-planning,the paper designs a task-level intelligent contingency planning methods framework,positioning contingency planning of subgoal and task order as a research object,puts forward Diversified Situational Information Prediction Code Methods Framework,defines function requirements and basic modules of diversified situational information processing in a complicated antagonistic environment.The paper also comes up with two types of RL method frameworks based on ways of invocation,in the stage of task order planning.One is Time-Segmented Invocation RL Methods Framework,which is constructed on Isoperiodic Option Semi Markov Decision Process(IOSMDP),and depicts constraints of invocating full-autonomous prior-knowledge model intelligently;the other is Parameters Optimization RL Methods Framework,which is constructed on Compositional Action Markov Decision Process(CAMDP)and depicts constraints of invocating super-parameters prior-knowledge model intelligently.These three methods frameworks cover all stages of contingency planning and present the fundamental mode of embedding the prior-knowledge model into RL,which are infrastructures to the key techniques in the paper.(2)We propose Situational Information Contrastive Prediction Representation methods.To solute traditional sequential code’s shortage of representing situational evolution,the paper proposes Situational Information Contrastive Prediction Representation methods according to Diversified Situational Information Prediction Code Methods Framework;designs an autoregressive encoder to process sequential situational information,using mutual information loss to predict situation development.It has been verified by RL open beta test platforms and wargame platform,that the method effective in seizing the key situational evolutionary features and robust to noise and unessential features.(3)Uniform Time-Segmented Invocation RL is presented in the paper.Traditional time-scale action(or macro action)controlling is based on Semi-MDP model,in which the time scales of situation information and macro action are always inconsistent and may cause less efficiency in optimization of RL.The paper puts forward Uniform TimeSegmented Invocation RL in Time-Segmented Invocation RL Methods Framework based on IOSMDP,achieves uniform time scale in situational information process and macro action execution.Experiments on RL open beta test platforms and wargame platforms show that Uniform Time-Segmented Invocation RL can accelerate the convergence of training significantly,and gain better strategy rewards.(4)Sequential Actor-Critic Parameters Optimization RL is proposed in the paper.Task-level contingency planning of squad always needs the collocation of multiple priorknowledge models with super parameters.Thus,we design Sequential Actor-Critic Parameters Optimization RL in Parameters Optimization RL Methods Framework based on CAMDP.The method decomposes multi-dimensional combined actions into onedimensional atom actions,each atom action is conducted by both the actor net and critic net.The design of the actor net makes the method able to generate continuous and discrete parameter combinations.Conducting experiments on RL open beta test platforms and wargame platforms,we find our methods are getting a better workout than similar traditional algorithms in both training efficiency and strategy rewards.(5)The air fleet task-level intelligent contingency planning is implemented in the paper,under the classical antagonistic scenario.We present the whole three stages of intelligent contingency planning,that is situation information preprocessing,task contingency planning,and task order generating.In the stage of situation preprocessing,we design “Global Situation information Statistics Vector + Regional Situation information Statistics Grid”;In the stage of task contingency planning,we first give the contingency planning prior-knowledge model of air fleet,then implement situation awareness,subgoal design and intelligent operation command and control by Situational Information Contrastive Prediction Representation methods,Uniform Time-Segmented Invocation RL and Sequential Actor-Critic Parameters Optimization RL.Experiments show the agent has learned effective strategy for subgoal transition and operational command and control.The models were applied in National Wargaming Competition and won the championship(outstanding winner)in 2019 and 2020. |