| This topic uses autonomous navigation as the background to study the problem of intrinsic reward optimization in reinforcement learning.Autonomous navigation technology has a strong demand in many fields.This kind of navigation technology with high autonomy and less dependence on external equipment meets the needs of many specific mission scenarios.The current more mature navigation technology either relies on precision instruments and response program design,or relies on rich prior knowledge of the environment.This kind of method can neither achieve self-exploration without existing algorithms in an unfamiliar environment,nor is it good at accumulating experience under the background of lack of knowledge.Therefore,autonomous navigation technology has a lot of research space and research needs.Reinforcement learning is a machine learning method that uses the interactive data between the agent and the environment to learn a task completion strategy.Generally speaking,in autonomous navigation tasks in reinforcement learning,agents are not required to have sophisticated positioning instruments or rich prior knowledge of the environment.Instead,they set up intrinsic rewards to drive exploration to generate awareness of the environment,and then use this Recognize to complete navigation tasks.Compared with traditional navigation methods,autonomous navigation methods based on reinforcement learning have stronger autonomy and rely less on external devices.In the actual navigation process,the current observation scene may correspond to multiple training samples whose positions are different.The similarity of the appearance leads to the similarity of the intrinsic reward,which reduces the agent’s motivation to explore the remote similar observation area,thereby reducing the efficiency of exploration.The training samples with similar appearances may not necessarily have the same actions in the navigation task.The uncertainty of the mapping between the observed actions makes the strategy under the observation tend to be random.The agent wanders randomly,and finally makes Strategy training fails.This paper focuses on the problems of stagnant exploration and strategy failure that may occur in an environment with similar structures and repeated appearances of autonomous navigation algorithms based on reinforcement learning,and proposes two intrinsic reward optimization methods,the main contents of which include:Firstly,a recurrent neural network’s intrinsic reward model is proposed,which uses pre-order observation information to distinguish current similar observations.The existing reinforcement learning intrinsic reward model is structurally improved,and the recurrent neural network is used to add pre-order observation information to distinguish similar observations.Aiming at the inherent reward function of ICM,the uncertainty of the backward model is analyzed,and the uncertainty is removed by adding position information to the backward model.The experimental results show that the above method makes the strategy finally converge stably and has high strategy performance.Secondly,a reward weighting method is proposed,which trains the internal reward network through ICM and RND models,and trains the strategy network after weighting the internal reward and the external reward,so as to improve the driving force of the agent to complete the task.The experimental results show that weighting the rewards can maintain a high exploratory ability while speeding up the strategy convergence speed. |