Font Size: a A A

Feature Extraction In Deep Reinforcement Learning And Countermeasures For Sparse Reward

Posted on:2024-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:X C WuFull Text:PDF
GTID:2568307106982159Subject:Electronic information
Abstract/Summary:PDF Full Text Request
In recent years,deep reinforcement learning has achieved remarkable success in various fields,such as Go,autonomous driving,and traffic lights.However,the decision-making process of agents is affected by several critical factors.Firstly,the quality of feature extraction can influence the decision-making and the training speed.Secondly,in scenarios where opponent information is hidden,agents are unable to take into account historical actions like humans when making decisions.Thirdly,the problem of sparse rewards can result in slow or even non-existent learning for agents.In this thesis,we investigate these issues in the context of the game Dou Dizhu.Our specific work focuses on addressing the aforementioned problems:(1)In order to address the problem of low win rates in situations where opponent information is unknown,corresponding to the issue of slow learning and disregarding historical actions when playing cards in Dou Dizhu,this thesis proposes an improvement to the Double Deep Q-Learning algorithm.We design a binary encoding method to simplify feature extraction and accelerate learning.We use Gate Recurrent Units to extract historical sequence action features,and combine them with state information features to make reasonable decisions.Experimental results demonstrate that the improved Double Deep Q-Learning algorithm has shorter training time and higher win rates.(2)In order to address the problem of sparse rewards,this thesis proposes a combination of inverse reinforcement learning and Double Deep Q-Learning algorithm.Human expert trajectories are sampled and a reward function is defined.The Double Deep Q-Learning algorithm uses the reward function to learn policies in the environment and compares the policies with expert policies.The reward function is iteratively updated to make the agent’s policy approach the expert policy.Experimental results show good performance in the early stages of training.(3)To avoid the difficulty of sampling human expert trajectories in Work II and the influence of human strategies on the agent’s policy,Work III utilized the Upper Confidence Bound Apply to Tree algorithm to sample a large number of samples to objectively estimate rewards for various situations.A reward prediction agent that can obtain information about all players was designed to learn rewards,which then guided the learning of the agent’s policy in Work I.Experimental results show the effectiveness of the algorithm in predicting rewards and guiding agent learning.
Keywords/Search Tags:Deep reinforcement learning, Double deep Q-learning, Gate recurrent unit, Upper confidence bound apply to tree algorithm, Sparse reward
PDF Full Text Request
Related items