Research On Optimization Methods Of The Experience Replay Mechanism For Off-policy Reinforcement Learning

Posted on:2021-03-14

Degree:Master

Type:Thesis

Country:China

Candidate:Q Cao

Full Text:PDF

GTID:2428330614970790

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Reinforcement learning has gradually become a research hotspot in the field of artificial intelligence in recent years.It has achieved great success in the fields of games,control,natural language processing,etc.With increasingly complex reinforcement learning tasks,in order to improve the utilization of data,a mechanism named experience replay is proposed.In prior work,experience replay methods mostly use random sampling,and each experience is replayed at the same frequency,which is not beneficial to the effective use of past experiences.To solve this problem,a new algorithm called High-Value Prioritized Experience Replay(HVPER)for off-policy reinforcement learning is presented,which aims to improve learning efficiency in reinforcement learning algorithms.This paper first proposes a prioritized experience replay method based on temporal difference(TD)errors and rewards in sparse reward environment.When designing the priority of experiences,the algorithm combines the priority of reward with Prioritized Experience Replay(PER)algorithm.The proposed approach is tested in Blind Cliffwalk and Gym environments.Experimental results verify that,compared to the Deep Q Network(DQN)algorithm and PER algorithm,the combination of TD error priority and reward priority improves the training speed of reinforcement learning algorithm in tasks with sparse rewards.In addition,the algorithm is extended to general environments and an experience replay method based on TD errors and state-action values is proposed for reinforcement learning tasks without sparse rewards.A set of comparative experiments is designed in Gym environments to evaluate HVPER algorithm.Compared with the Deep Deterministic Policy Gradient(DDPG)algorithm,HVPER algorithm accelerates the training of network and obtains better performance in tasks with consecutive space.In fact,the reward or the state-action value represents the return or long-term return of the action-state,so it helps the agent to get high reward.TD error indicates the accuracy of the model fitting,so it helps the model to converge quickly.Therefore,the combination of the two indicators makes reinforcement learning algorithms quickly converge to a higher level.In order to verify the effectiveness of the algorithm in practical applications,HVPER algorithm is applied to the autopilot task based on the game engine named X-Plane.An autopilot task is designed in a non-sparse reward environment to comparethe HVPER algorithm with the DDPG algorithm.The experimental results show that HVPER algorithm achieves better performance in the practical task,and improves the training speed and the success rate of the autopilot task.

Keywords/Search Tags:

Deep Reinforcement Learning, Experience Replay, High-Value, Temporal Difference Error, Autopilot

PDF Full Text Request

Related items

1	Research On Experience Replay Method For Deep Reinforcement Learning
2	Research On Goal-oriented Model-based Reinforcement Learning
3	Deep Reinforcement Learning With Experience Replay
4	Improvement And Application Of Deep Reinforcement Learning Based On Experience Replay Mechanism
5	Research On Experience Replay In Deep Reinforcement Learning
6	Research On Motion Planning In Dynamic Environment Based On Deep Reinforcement Learning
7	Research On Optimization Method Of Deep Reinforcement Learning Experience Replay
8	Research On Security Deep Reinforcement Learning Based On Experiences
9	Study Of Robot Arm Control Based On Deep Reinforcement Learning
10	Research On Weight Update Method In Temporal Difference Algorithm