In recent years,the deep neural network has developed rapidly,which provides strong support for the further application of reinforcement learning to complex tasks.At present,deep reinforcement learning,which uses a deep neural network as the agent policy,usually needs a large number of experiences containing reward signals to train to improve the performance of its policy.However,in most deep reinforcement learning tasks,rewards provided by the environment are too sparse compared with the state space of the task.So it is difficult for agents to collect enough reward signals to support their training,resulting in the unsatisfactory performance of deep reinforcement learning algorithms in complex tasks.Aiming at the sparse reward problem commonly existing in deep reinforcement learning,this thesis mainly studies the design of intrinsic rewards to provide dense reward signals for agents,so as to improve the exploration efficiency and policy performance of agents.This thesis mainly focuses on two aspects of reward shaping and credit assignment and put forward the following two methods.(1)A Design of Intrinsic Rewards based on Pseudo Reward and Action Importance ClassificationThe real-world tasks usually have the characteristics of large state space and action space.Aiming at the situation that intrinsic rewards defined by expert knowledge often have noise in such tasks,this thesis proposes a design of intrinsic rewards based on pseudo reward and action importance classification.The pseudo reward network transforms the relationship between the intrinsic rewards and states into pseudo reward knowledge and transmits this knowledge to the agent to help it overcome the noise problem brought by intrinsic rewards.The part of action importance classification uses expert knowledge to calculate the relationship between actions and states,and forms the action importance knowledge to the agent to help the agent make more valuable action decisions.This thesis can fully extract useful information from expert knowledge,so it can effectively improve the training speed and performance of the algorithm based on reducing the use threshold of hand-designed intrinsic reward methods.(2)A Design of Intrinsic Rewards based on Credit AssignmentAiming at the problem that the hand-designed intrinsic reward method is difficult to be widely used,this thesis proposes a design of intrinsic rewards based on credit assignment.This method uses an intrinsic reward prediction network to learn the relationship between the agent’s state-action pair and intrinsic rewards,to provide the corresponding intrinsic rewards for each state-action pair of the agent.To improve the prediction accuracy of the intrinsic reward prediction network,this method distributes the extrinsic rewards to the agent’s state-action pair based on credit assignment.In this way,the supervised signals required for training can be provided for the intrinsic reward prediction network.This method mainly constructs a self-supervised learning network that can be iteratively trained with the agent’s main task and uses this network to provide a large number of reward signals for the agent to promote the rapid improvement of the agent’s policy performance.This thesis mainly experiments with Multi-Agent Particle Environment and Google Research Football simulation to verify the effectiveness of the proposed method,and the experimental results show that our methods not only speed up the training speed of the deep reinforcement learning algorithm but also improve the performance of the algorithm. |