Font Size: a A A

Model-based Off-policy Optimization

Posted on:2022-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:G J GeFull Text:PDF
GTID:2518306533472814Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the development of deep reinforcement learning,model-free deep reinforcement learning has been successfully applied in games,robot control and other fields.In engineering applications,agents based on model-free deep reinforcement learning need to interact with the environment continuously until they learn the optimal policy.However,the interaction between agent and environment is often accompanied by complex problems of exploration and utilization,which greatly reduces the sample efficiency of reinforcement learning algorithm,and the agent can only learn the suboptimal policy.Model-based reinforcement learning provides a solution to the problem of low efficiency of model-free reinforcement learning.This paper focuses on the use efficiency of samples in deep reinforcement learning algorithm and conducts research on model-based deep reinforcement learning.The main research contents of this paper are as follows:(1)Aiming at the problem of low sample efficiency of model-free deep reinforcement learning algorithm,off-policy optimization based on latent dynamics model(LDM-OFP)is proposed.Firstly,the latent dynamics model is constructed in the latent variable space,the latent dynamic model parameters are trained by optimizing the maximum likelihood estimation of agent environment interaction data;Then,a large number of trajectory data are generated by using the exploration policy of maximizing the temporal difference error and the execution strategy of maximizing the cumulative reward,and the importance sampling is used to reduce the influence of the different policy on the training value function;Finally,combined with multi-step bootstrap and exponential weighted average method to train the value function,exploration policy and execution policy,so that each model can overcome the cumulative bias caused by long-term prediction to a certain extent.(2)Aiming at the problem of poor model generality in model-based reinforcement learning,off-policy optimization based on meta model algorithm is proposed.Different from the latent dynamic model for single task,the meta model aims to find a set of sensitive model parameters to adapt to multi-task quickly.Firstly,on the basis of the latent dynamic model in the first part,a meta model based on two-step degree optimization is proposed.The parameters of the meta model are trained by using the data of the interaction between the agent and the training tasks;Then,the parameters of the meta model are fine tuned by using the data of the interaction between the agent and the new task to obtain the latent dynamic model that can quickly adapt to the new task;Finally,the trajectory data is generated by using the latent dynamic model,and the policy of the agent is trained by combining with the LDM-OFP algorithm proposed in the first part.This paper evaluates the effectiveness of the proposed algorithm on the Control Suite platform based on Mu Jo Co physical engine,and compares it with other advanced algorithms.The experimental results show that the two model-based deep reinforcement learning algorithms can achieve the performance of other advanced algorithms when the number of interaction steps between agent and environment is small,and the sample efficiency is significantly better than other model-free deep reinforcement learning algorithms.There are 25 figures,13 tables,and 87 references in this thesis.
Keywords/Search Tags:deep reinforcement learning, latent dynamic model, temporal difference error, multi-step bootstrap, meta model
PDF Full Text Request
Related items