Font Size: a A A

Q-learning Potential Reward Online Learning Technology Inspired By Priori Knowledge

Posted on:2023-12-30Degree:MasterType:Thesis
Country:ChinaCandidate:X X ChenFull Text:PDF
GTID:2568307169480934Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
In recent years,the reinforcement learning algorithm has developed rapidly,and its application fields cover game AI[1][2],UAV swarm control[3],natural language processing[4]and so on.While making progress in reinforcement learning research,it also faces many problems and challenges.The problem of reward sparsity is one of the main challenges faced by reinforcement learning algorithms.It is common in practical problems such as robot control[5]and unmanned driving[6],which hinders the application of reinforcement learning.The reward signal describes the goal of the agent and is a key factor affecting the training of reinforcement learning algorithms.The reward sparsity problem means that it is difficult for the agent to obtain the target signal through random actions,and it needs to perform a long sequence of actions to get the environmental reward.On the one hand,the problem of reward sparseness makes it difficult for the agent to find key actions that are useful for solving tasks in long sequences of actions.On the other hand,the acquisition period of the signal is too long,causing the agent to train slowly and even fail to converge.Reward design,as an important method to solve the problem of reward sparseness,is a research hotspot in the field of reinforcement learning.Researchers hope to promote the solution to the problem of reward sparseness by studying effective reward design methods.The reward shaping method based on potential energy is one of the main methods of reinforcement learning reward design.The reward function shaped according to this method can accelerate the training of agents and reduce the training time for agents to converge to the optimal strategy.However,in practical engineering application,reward shaping based on potential energy is a complex work.Firstly,the numerical state potential energy function should be defined.Without the prior knowledge of potential energy function,designers often need to set the potential energy function according to their personal ideas.The potential energy function designed in this way is often not good enough,and even mislead the agent,leading the designer to fall into the repetitive work of training-poor training effect-fine tuning reward function-retraining.At the same time,in the application environment of reinforcement learning algorithm,there are often many high-order human prior knowledge.How to use these prior knowledge to heuristic design state potential energy function and reduce the workload of designers has become the focus of this paper.The main contents of this paper are as follows:1.Introduce the idea inspired by a priori knowledge and study the online reward shaping method.In this study,a priori knowledge heuristic method is used to extract the human goal in expert knowledge as the sub goal of reward shaping,and the sub goal is used as the key node for state aggregation.Based on the aggregation state,an upper level abstract Markov decision process(MDP)model is established.The MDP is solved by online reinforcement learning algorithm,and then the abstract value function is obtained.After obtaining the value function,build the reward function online according to the value function.Using the above method,the reward value inspired by prior knowledge can be provided for the agent at the key nodes,and the exploration efficiency of the agent can be improved.2.Combined with the online reward shaping method inspired by prior knowledge proposed above,this study designs a learning framework combining online reward shaping and offline reinforcement learning,and studies the application effect of online reward shaping method inspired by prior knowledge in single agent reinforcement learning.The first layer of the learning framework learns the potential energy function and reward function online by solving the abstract state MDP model.The second layer uses the specific state and the reward function provided by the upper layer to construct the specific Markov decision-making process model,and uses the off-line reinforcement learning algorithm to solve the optimal strategy.In order to verify the effectiveness of the proposed framework for solving the reward sparsity problem,this study applies the learning framework combining online reward shaping and dqn algorithm to the path finding experiment in maze environment.The experiment shows that the framework can effectively improve the exploration efficiency of the algorithm.3.Designed the learning framework of the combination of a priori knowledge inspired online reward shaping method and marl algorithm,and studied the application effect of a priori knowledge inspired online reward shaping method in MAS system.Based on the above framework,aiming at the reward sparsity problem in multi-agent system(MAS),this study combines the online reward shaping algorithm with the classic algorithm qmix algorithm in the field of marl.By using the method of online reward shaping,we can promote the agent to carry out reasonable reward long-term credit distribution,promote the denseness of reward and solve the problem of sparse reward.Through distributed decision-making and centralized training,learn the spatial distribution of reward function on multi-agent,guide single agent to make decision according to local observation,and cooperate with multi-agent to consider the global Pareto optimal solution of the algorithm.4.Aiming at the problem of sparse reward in military chess deduction,based on the deduction platform of national military chess challenge competition,carry out the application research of online reward shaping method in military intelligent game.Based on the deduction platform of the national military chess challenge competition,this study designs a military chess deduction environment for sea and air cooperative operation for experimental verification.In the experiment,the red side controlled by qmix-krs algorithm,the red side controlled by simple qmix algorithm and the blue side controlled by traditional rule algorithm are used to fight respectively.After 1000 rounds of training,the red side controlled by qmix-krs algorithm has learned a better multi-agent cooperation strategy,and 70%of the victory rate can be obtained when fighting with the blue side;In the sea air cooperative combat environment with sparse reward,the learning effect of simple qmix algorithm is poor.Comparative experiments show that the method framework proposed in this study can guide intelligent weight points to pay attention to the war key nodes extracted from prior knowledge in the Multi-Agent Reinforcement learning environment,and improve the exploration efficiency of reinforcement learning algorithm.
Keywords/Search Tags:Reinforcement learning, Reward shaping, Priori knowledge, Abstract MDP model, Military intelligence game
PDF Full Text Request
Related items