Font Size: a A A

Deep Value Iteration Network For Partially Observable Markov Decision Process

Posted on:2019-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:G H MiaoFull Text:PDF
GTID:2518306473453774Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Recent years,deep reinforcement learning(DRL)has been a new research hotspot in the field of artifical intelligence.At present,deep reinforcement learning has been successfully applied to many fields,such as game strategy,machine translation,text generation,target tracking and so on.However,most deep reinforcement learning models are used to solve the task which environments' s states are completely observable,while the research on deep reinforcement learning in partially observable environments is still very few.At the same time,the existing models only utilize the generalization ability of the deep neural network to fit the value function or the policy function,and such model-free methods can achieve good performance,but often ignore the structural information of the task.In order to make full use of this structural information,a deep reinforcement learning algorithm based on value iteration is proposed in this paper,which effectively combines the advantages of model-free learning and model-based planning.Inspired by the ADRQN,the model uses two independent recurrent neural networks to complete the calculation of the beliefs of state and the iteration of the Qvalues,and the outputs are integrated for the choice of action.Similar to the calculation of the beliefs of state,the latent state of the recurrent neural network used to iterate the Q-values is considered as a embedding representation of Q-values.Through backpropagating gradients,the model learns the structural information of the potential Markov Decision Process,and then completes the iteration of Q-values.Secondly,the sub optimality of action-observation pair's coupling input form in ADRQN model is analyzed in this paper.In order to avoid the suboptimality caused by difference between the input forms of observation and action,this paper makes a simple modification to the input structure of the model and applies it to the standard POMDP tasks.Compared with the ADRQN,the proposed model achieves better performance on multiple navigation tasks,which verifies the effectiveness of the improved methods proposed in this paper.
Keywords/Search Tags:Reinforcement Learning, Deep Learning, Partially Observable Markov Decision Processing, Value Iteration
PDF Full Text Request
Related items