Font Size: a A A

Research On Sample Generation And Selection Methods For Deep Reinforcement Learning

Posted on:2022-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:T YangFull Text:PDF
GTID:2518306527970309Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Deep reinforcement learning is an important research branch used in the field of Artificial Intelligence for sequential decision-making problems.It learns the optimal policy based on the samples generated during the interaction between agent and environment.Since the learning process requires sample generated by a large number of interactions between agent and environment as support,deep reinforcement learning algorithms are limited in some applications where we acquire samples expensively.The use of different behavior strategies in deep reinforcement learning will generate different samples.At the same time,the choice of samples will also affect the learned policy.To improve the sample efficiency of deep reinforcement learning,reduce the interaction between agent and environment,and obtain a high-quality policy,the following works are completed in this paper:(1)Proposing an adaptive ?-greedy policy based on average episodic cumulative reward(AECR-DQN).The "?-greedy" policy,commonly used in deep reinforcement learning,is a random exploration policy for sample generation.This policy does not consider other factors that affect the agent decision-making,so it is of some blindness.Therefore,this paper uses the episodic cumulative reward received by the agent after it completed a task as a guide for the agent to conduct exploration or exploitation reasonably.Experimental results show that the deep Q network algorithm based on the adaptive ?-greedy policy of average episodic cumulative rewards can generate samples that are more conducive to learning the optimal policy and obtain higher rewards.(2)Different from the traditional deep reinforcement learning that takes samples of one-step transitions uniformly at random from the experience replay memory,a method of sample generation and selection using a whole episode as the training sample is proposed.First,a method for generating episode samples based on genetic crossover operator(GCO-DQN)is proposed in which a similar state in the two episodes is used as the crossover point to generate episodes that have not appeared.Thus,the number and diversity of episodes are increased.Based on the expansion of episodes,a method for selecting episodes based on genetic selection operation(GSCO-DQN)is proposed.The cumulative reward of an episode is used as the criterion for judging the importance of the episode.The method can not only ensure the diversity of episode but also increase the sampling probability of episodes with high importance.Experimental results show that the method for generating and selecting deep Q network samples based on genetic operators can reduce the number of interactions between agent and environment,improve the sample utilization,and obtain a policy with higher rewards.(3)Combining AECR-DQN and GSCO-DQN,a sample generation and selection method based on genetic operator and adaptive ?-greedy policy(AECR-GSCO-DQN)is proposed.After the adaptive ?-greedy policy generated samples in a more targeted manner,the genetic crossover operator applied on this sample to get more diverse samples,and then the genetic selection operator selects from the samples that are more conducive to learning the best policy.Experimental results show that compared with GSCO-DQN,AECR-GSCO-DQN can achieve a higher average reward,and improve the level of policy.
Keywords/Search Tags:Deep reinforcement learning, sample efficiency, episodic cumulative reward, experience replay memory, genetic algorithm
PDF Full Text Request
Related items