Font Size: a A A

Research On Experience Replay Method For Deep Reinforcement Learning

Posted on:2021-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:S M ShiFull Text:PDF
GTID:2428330605474870Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The deep reinforcement learning method combined deep learning and reinforcement learning has shown strong generality and made breakthrough progress in the processing of complex decision tasks.It uses deep neural networks to build network models combined with the decision-making capabilities of reinforcement learning,without any prior knowledge,the agent can learn execution policies based on reward signals from environment.The experience replay method eliminates the temporal correlation between transitions during network training and enables some transitions to be reused,improved the data utilization rate.However,how to store and select transitions has a significant impact on network training.To further improve the performance of the experience replay method in deep reinforcement learning,this paper studies and improves from the following three aspects:i.The priority experience replay method based on Temporal Difference-error(TD-error)measures the importance of each transition by the absolute value of TD-error.However,changes in neural network parameters will affect the TD-error of transitions in the experience replay buffer.As a result,there will be deviations in the sampling process.Aiming at this problem,the reward value of the transition is used as the measure of priority sampling,and the priority sampling probability is adjusted by the size of the reward value.A deep deterministic policy gradient algorithm based on reward priority sampling is proposed.Finally,a series of experimental results in the MuJoCo environment proved the superiority of the algorithm.ii.The experience replay buffer uses a first-in-first-out storage method.When the transitions changed,the method needs to give priority to the newly added transitions and modify the priority of the original transitions.When transitions are acquired,the method needs to be sampled according to the priority,so the algorithm time complexity is increased to a certain extent.In order to further improve the efficiency of the algorithm,a deep deterministic policy gradient method with classification experience replay is proposed,which is classified according to TD-error or reward value in transitions.Finally,the superiority of the algorithm is also verified in tasks with continuous state action space.iii.In the experience replay method,transitions generated by the interaction between the agent and the environment are stored in the experience replay buffer at each time step.After reaching a certain number of transitions,each mini-batch transitions are selected for network training.To reduce the redundancy of transitions in the replay buffer and enrich the sample data obtained in each mini-batch,multiple agents are set up to interact with the environment to generate experience transitions,so that network training can have more sufficient and diverse data.This paper proposes a deep Q-learning network method with multi-agent sampling and verifies the superiority of the algorithm through a series of tasks with discrete state action space.
Keywords/Search Tags:Deep reinforcement learning, experience replay, priority experience replay, classification experience replay, multi-agent sampling
PDF Full Text Request
Related items