Font Size: a A A

Research On Sample Adaptive Action Planning Based On Predictive Coding

Posted on:2023-07-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:X X LiangFull Text:PDF
GTID:1522307169976949Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Traditional combat operations planning systems are mainly driven by a priori knowledge and operations optimization models,which have good explanatory and extrapolation speed,but the relevant knowledge needs to be modelled manually.After the system is finalised,it is difficult to expand effectively and in a timely manner,and is unable to respond quickly to dynamic changes in the environment and cope with complex environments with incomplete information and uncertain boundaries.Deep reinforcement learning,represented by Alpha Go,Alpha Star,Alpha Dog Fight and other intelligent gaming confrontation results,uses powerful machine computing power and huge amounts of training data to support "end-to-end" learning from state to action,realising autonomous action planning under incomplete information conditions.It provides a new solution to the challenges of traditional combat operations planning,such as the weak ability to cope with incomplete information and the difficulty of dynamic ad hoc adjustment.However,the direct application of existing deep reinforcement learning algorithms to combat operations planning faces problems such as slow convergence of policies,low data utilization,and lack of experimental platforms.To address these problems,in terms of algorithm design,the research points of state representation,sample value estimation discrepancy and the application of reused historical samples are used to carry out the research of action planning based on predictive coding and sample adaptive action planning with time difference sensitivity to improve sample utilization and accelerate policy convergence;in terms of experimental platform framework,an intelligent learning framework for action planning oriented to wargame is designed to support the application of reinforcement learning and other algorithms.The main work and innovation points of this paper include:(1)We propose an action planning approach based on predictive coding.The existing state encoding representation is only a static encoding of the environment state,ignoring the dynamic characteristics of the planning task itself,which increase the difficult of policy learning.We propose a predictive encoding learning and planning method based on an autoregressive model that uses an autocoder to compress the state space representation of the environment,self-learns the dynamics of the environment in the low-dimensional space,and uses a recurrent neural network-hybrid density network to predict future information such as successor states and rewards under the lowdimensional space of the environment.We use the hidden state during learning as a predictive encoding of the current state,and develop action planning policy based on the predictive encoding.The experimental results show that in the feature clustering task based on the Cartpole environment,the predictive coding is significantly aggregated,forming eight clustering centres with significant inter-class differentiation,outperforming the VAE coding;in the successor information prediction task,the reconstruction loss of the predictive coding converges to 0.01,while the VAE coding converges to 0.1,and the end-state prediction loss value of the predictive coding is as low as 0.06,while the VAE coding was 0.15 in the test set;in the classical control task of Open AI Gym,the policy based on predictive coding converged 54.3% faster than VAE coding,and the final value of convergence was 50% higher than that of VAE coding.(2)A sample-based adaptive action planning method based on temporal-difference error is proposed.By evaluating the value function loss of experiences,we construct an adaptive factor based on the temporal-difference error to dynamically adjust the value function loss of different experiences so that important experiences receive higher update weights,and embed this adaptive factor into mainstream reinforcement learning algorithms such as deep Q-networks and Actor-Critic(AC)respectively.In the experiment,the DQN network with embedded adaptive factors converges faster than the vanilla DQN as well as the DQN with prioritised experience replay in the classical control task Cart Pole-v1,with a 26% and 15% improvement respectively;in the Atari2600 games,such as amidar,assault,berserk,breakout,qbert,and roadrunner,the convergence final value is 15% higher than the vanilla DQN method for the same size of training data volume.The AC method with embedded adaptive factors converges 50% faster than the vanilla AC method in the classical control task Cart Pole-v1,and 25% faster than the vanilla AC method in the Atari2600 games of amidar,assault,berserk,breakout,qbert,and roadrunner,with the same size of training data.The convergence final value is 25%higher than the vanilla AC method,and the fluctuation value is smaller.Experimental results show that a good critic or Q function can guide policy improvement,and accelerating the learning of critic or Q functions can improve the efficiency of policy improvement.(3)We proposes a proximal policy optimization algorithm with prioritized trajectory replay(PTR-PPO).Existing experience replay methods,using one-step experience for sample value assessment,and one of the advantages of the policy gradient method is to improve the efficiency of policy improvement through multi-step experience,how to assess the value of multi-step experience,as well as efficient use of historical experience is still a problem to be solved.We first design three trajectory priorities based on the characteristics of trajectories: the first two being max and mean trajectory priorities based on one-step empirical generalized advantage estimation(GAE)values and the last being reward trajectory priorities based on normalized undiscounted cumulative reward.Then,we incorporate the prioritized trajectory replay into the PPO algorithm,propose a truncated importance weight method to overcome the high variance caused by large importance weights under multistep experience,and design a policy improvement loss function for PPO under off-policy conditions.The experimental analysis was carried out in the Atari environments of Atlantis-v0,Bowling-v0,Breakout-v0,Name This Game-v0,Qbert-v0,and Up NDown-v0.The experimental results show that the PTR-PPO algorithm,based on different trajectory priorities,outperforms the vanilla PPO algorithm for the same size of empirical data volume,and has obvious advantages in the five environments of Atlantis-v0,Bowling-v0,Name This Game-v0,Qbert-v0,and Up NDown-v0,and in Breakout-v0 with a more pronounced advantage.During the experimental analysis,it was found that the PTR-PPO algorithm has a direct impact on the performance of the policy by influencing the weight distribution of the trajectory in the priority memory,and reflects the performance creeping and levelling process of the policy through the change of the weight distribution.(4)An intelligent learning framework for combat operations policy for wargame is constructed.The existing wargame platform lacks consistent modelling of the combat operation planning process and a decision framework to support intelligent planning,and the data throughput rate under a single machine can hardly support intelligent learning of combat operation policy.The framework for intelligent learning of operational policy for wargame uses a Markovian decision process to model the action planning process,and defines the elements of state,action and reward in operational tasks,and describes the action planning process based on deep reinforcement learning;based on the wargame platform,a multi-computer parallel game confrontation framework based on a python interface is built to support multi-entity and multi-role participation in game confrontation,and an intelligent learning framework for operational policy is designed,using a masterslave learning architecture for operational policy learning.Finally,a case study is conducted based on the scenario of a machine-to-machine confrontation in the National wargame competition.In terms of training architecture design,a multi-computer parallel game adversarial framework is used for multi-role and multi-agent adversarial to enhance the amount of training data.In terms of algorithm design,a predictive coding representation is used to represent high-dimensional and incomplete situational information to enhance the capacity of feature representation,reduce the number of parameters of the network,and reduce the data dependency of the model;data utilisation is enhanced and policy improvement efficiency is accelerated using time-difference sensitive sample adaptation and proximal policy optimization algorithm with prioritized trajectory replay.The experimental results show that the sample adaptive algorithm based on predictive coding—MDN-AF has the highest score ranking,with an average win rate of 80%,of which 62.5% of the games achives the operational goal,and the algorithm has learned four long-term strategies: autonomous wave division,supplementary reconnaissance strategies,"snake" strike strategy,bomber back raid.
Keywords/Search Tags:action planning, reinforcement learning, intelligent game, wargame
PDF Full Text Request
Related items