Font Size: a A A

Research On Machine Learning Algorithms Based On Planning Network Model

Posted on:2021-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:Z X ChenFull Text:PDF
GTID:2428330605974874Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Due to the popularity of deep learning in the field of artificial intelligence,neural network models have been widely used in reinforcement learning,imitation learning or meta-learning problems.In these machine learning problems,policies are usually represented by neural networks trained through reinforcement learning,imitation learning algorithms,or meta-learning frameworks.However,due to the lack of clear planning operations,these type of network policies are still reactive in nature.In response to this problem,this paper proposes a variety of machine learning algorithms based on planning network models,and conducts research and analysis in the fully observable Markov decision process(MDP),partially observable Markov decision process(POMDP),and meta-learning framework.The main research content can be summarized into the following three parts:i.Generalized value iteration network(GVIN)is a planning network model for MDP,the value iteration process involved in GVIN does not reasonably allocate the planning time required for each state according to the importance of the state,this reduces the planning performance and generalization ability of the network.Therefore,using the state-based asynchronous update method to propose the generalized asynchronous value iteration network,in the new network the planning time required for each state can be reasonably allocated.Secondly,the training algorithm used in GVIN is episodic Q-learning,in which there is still the problem of overestimation of the same value as in Q-learning.Therefore,combining the idea of weighted double estimator with episodic Q-learning,episodic weighted double Q-learning is proposed to minimize the effect of overestimation on training performance.Finally,a new type of graph convolutional operator is proposed,which can effectively weaken the influence of the degree distribution of nodes in the graph structure of the task on the planning results,and then improve the planning performance of the network.ii.QMDP-net is a planning network for POMDP,the network uses QMDP algorithm to solve POMDP,and the internal mechanism of QMDP uses value iteration,which makes the planning process of QMDP-net has similar problems with GVIN.Therefore,using the idea of asynchronous update,an asynchronous update method based on a partially observable environment is embedded in the QMDP-net planning module,and asynchronous QMDP-net is proposed.In addition,because QMDP assumes that the uncertainty of the agent 's current belief state will disappear after it performs the next action,this means that the policy generated by the planning network cannot be applied to the task domains that need to repeatedly collect information,thus causes a decline in network planning performance.Therefore,replicated Q-learning is used to partially replace the QMDP,and a recurrent policy network that can better plan in the partially observable environment is thus proposed.iii.MAML is a planning network for meta-learning,this framework based on the characteristics of meta-reinforcement learning,the parameters in the network can be continuously trained using the policies and trajectories that the agent has performed before through gradient descent,and quickly adapt to different new tasks and plan effective policy.Since the meta-update process of MAML requires gradient descent to estimate the second derivative,this reduces the training stability and generalization of the algorithm to a certain extent.Therefore,a new meta-learning framework is proposed.The new framework can better perform the meta-optimization process,so that the resulting policy has better generalization ability.
Keywords/Search Tags:deep reinforcement learning, imitation learning, planning, asynchronous update, meta-reinforcement learning
PDF Full Text Request
Related items