Font Size: a A A

Research Of Advancing Deep Reinforcement Learning

Posted on:2019-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:J M XiaoFull Text:PDF
GTID:2428330590975370Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Reinforcement learning is a subdomain of machine learning.It has good performance in Markov decision process problems,such as industrial control,board game and automatic drive.The deep neural network research has become an important subdomain of machine learning after solving problems such as the disappearance of gradients and increasing performance of floating-point computing in computers.Therefore,it has become a research hotspot to combine the advantages of reinforcement learning and neural networks.In the work of DeepMind on Deep Q Network(DQN)inspired by Neural Fitted Q(NFQ)algorithm,deep neural network is used as a function approximator of Q-value table and they use the experience replay mechanism to make sure that samples of network are independent and identically distributed.DQN shows good results on video games.However,since it's a time-difference method,DQN algorithm is not efficient enough.This thesis attempts to improve the DQN algorithm from two perspectives: 1)accelerating the updating process by searching upper and lower bound of Q value of a stateaction pair,and 2)adjusting sampling probability in experience replay pool.The theoretical basis for the deep Q network algorithm(DQN)is the property of Bellman equation,but because it is a model-free algorithm,which cannot plan a policy in a bottom-toup manner by dynamic planning method.The DQN is a kind of time-difference method of which agent continuously interacts with the environment and repeats the update process until it converges.This method is very time-consuming.Therefore,in this thesis we propose a novel method to calculate a upper and lower bound of Q value of a state-action pair which can restrict evaluating of its target Q value,and it can improve the performance of DQN.On the other hand,in the prioritized experience replay method,the sample probabilities of states sampled from experience replay pool will be updated by their TD errors,but the states before them will not be changed.In this thesis,we consider that the new Q values should be propagated to their previous states.As a result of this,the sample probabilities of previous states should also be increased.Therefore,this can improve the efficiency of prioritized experience replay.Simulation results show the efficiency of our algorithms.
Keywords/Search Tags:Reinforcement learning, neural network, Q-learning, Optimization
PDF Full Text Request
Related items