Font Size: a A A

Research On Learning Of The Optimal Policy In Largescale State Space

Posted on:2018-05-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:S ZhongFull Text:PDF
GTID:1318330542465266Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Reinforcement learning is a kind of important machine learning method for obtaining optimal policy by maximzing cumulative expected return through the interaction with the environment.According to the model knowledge is given or not,the reinforcement learning methods can be divided into model-free or model-based method.Compared with the modelfree methods,the model-based methods can find the optimal policy quickly with the higher sample efficiency.However,the model is unknown in most pratical problems,so it should be learned before the planning process.Recently,the method based on model learning has been a hot spot in the field of the reinforcement learning.This paper starts from the viewpoint of model learning and discusses the problems in large-scale state space such as low sample efficiency and slow convergence rate,a seires of solutions are proposed.The main contents in this paper mainly include:(1)For the problems with continuous state space,a heuristic Dyna optimal algorithm based on approximate model representation—HDyna-AMR is proposed to find the optimal policy.HDyna-AMR uses the linear function to represent the feature transition matrix and the reward vector,and the samples interacted with the environment are used to learn the feature transition matrix and reward vector,and the feature emerging frequency is recorded.In the planning,the feature frequency is used as the priority and it is used in the priority queue for achiving important sampling and improving the planning efficiency.Moreover,the convergence of HDyna-AMR is analyzed.(2)In order to deal with the problems with continuous state and action spaces,an actorcritic algorithm based on hierarchical model learning and planning is proposed,called ACHMLP.AC-HMLP includes two models: local model and global model.The local model is approximated by local linear regression(LLR),while the global model is represented by linear function approximation.The samples collected in the interaction with the environment are used to learn the model and update the value function and policy.If the local model error does not surpass the error threshold,the local planning can be started,however,the global model is launched after each episode.Two models are coorpated to utilize the local and global information of the samples,to improve the model accuracy and the convergence.(3)To further improve the sample efficiency,an approximate reinforcement learning method based on LSTD(?)and policy approximation is proposed,called Dyna-LSTD-PA.Similar with the former algorithms,Dyna-LSTD-PA includes two current processes.The learning process chooses the action according to the Gaussian distribution,and then use LSTD(?)to reprent the value function,policy and model.The planning process takes the off-LSTD(?)to update the parameters of the value function.Dyna-LSTD-PA takes Sherman-Morrison to improve the computation efficiency,at the meantime,the parameters learned from the learning and planning processes are weighted to get the final parameter for the value function.The global error bound for Dyna-LSTD-PA is derived.(4)A regulized natural AC algorithm by using experience replay and model learning is proposed so as to deduce the variance for the policy gradient and improve the convergence rate.RNAC-ML-ER algorithm not only uses the samples on-line to learn the model,but also uses them to fill the experience replay memory.At each time step,the samples in the memory are replayed to improve the learning speed of the value function and the policy.To speed the convergence for the policy,the natual gradient takes place of the traditional policy gradient and the advantageous function is used as the goal function to compute the policy grandient.Under two given assumptions,the convergence of the algorithm is verified.
Keywords/Search Tags:reinforcement learning, model learning, function approximation, model planning, policy gradient, experience replay
PDF Full Text Request
Related items