Font Size: a A A

Research On Optimization Methods Of The Experience Replay Mechanism For Off-policy Reinforcement Learning

Posted on:2021-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:Q CaoFull Text:PDF
GTID:2428330614970790Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Reinforcement learning has gradually become a research hotspot in the field of artificial intelligence in recent years.It has achieved great success in the fields of games,control,natural language processing,etc.With increasingly complex reinforcement learning tasks,in order to improve the utilization of data,a mechanism named experience replay is proposed.In prior work,experience replay methods mostly use random sampling,and each experience is replayed at the same frequency,which is not beneficial to the effective use of past experiences.To solve this problem,a new algorithm called High-Value Prioritized Experience Replay(HVPER)for off-policy reinforcement learning is presented,which aims to improve learning efficiency in reinforcement learning algorithms.This paper first proposes a prioritized experience replay method based on temporal difference(TD)errors and rewards in sparse reward environment.When designing the priority of experiences,the algorithm combines the priority of reward with Prioritized Experience Replay(PER)algorithm.The proposed approach is tested in Blind Cliffwalk and Gym environments.Experimental results verify that,compared to the Deep Q Network(DQN)algorithm and PER algorithm,the combination of TD error priority and reward priority improves the training speed of reinforcement learning algorithm in tasks with sparse rewards.In addition,the algorithm is extended to general environments and an experience replay method based on TD errors and state-action values is proposed for reinforcement learning tasks without sparse rewards.A set of comparative experiments is designed in Gym environments to evaluate HVPER algorithm.Compared with the Deep Deterministic Policy Gradient(DDPG)algorithm,HVPER algorithm accelerates the training of network and obtains better performance in tasks with consecutive space.In fact,the reward or the state-action value represents the return or long-term return of the action-state,so it helps the agent to get high reward.TD error indicates the accuracy of the model fitting,so it helps the model to converge quickly.Therefore,the combination of the two indicators makes reinforcement learning algorithms quickly converge to a higher level.In order to verify the effectiveness of the algorithm in practical applications,HVPER algorithm is applied to the autopilot task based on the game engine named X-Plane.An autopilot task is designed in a non-sparse reward environment to comparethe HVPER algorithm with the DDPG algorithm.The experimental results show that HVPER algorithm achieves better performance in the practical task,and improves the training speed and the success rate of the autopilot task.
Keywords/Search Tags:Deep Reinforcement Learning, Experience Replay, High-Value, Temporal Difference Error, Autopilot
PDF Full Text Request
Related items