Font Size: a A A

Inverse Reinforcement Learning Algorithms In Semi-markov Environment

Posted on:2019-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:C F TanFull Text:PDF
GTID:2428330566998882Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Reinforcement learning takes great part in the field of intelligent decision making.As an important element of reinforcement learning,the reward function is usually set by experience,which is not only lack of theoretical support,but also has great limitations when faced with complex situations.As the opposite of reinforcement learning,inverse reinforcement learning gets the optimal reward function by learning expert strategy or demo trajectory,thus providing scientific automated-building method and theory support for corresponding reward function.The existing inverse reinforcement learning algorithms mainly model the dynamic environment as the Markov decision process,ignoring the important factor of time.The semi-markov decision-making process provides an effective model for describing time factors,but the research on inverse reinforcement learning with this environment model is still in infancy.Therefore,we carry out the study of inverse reinforcement learning algorithms for semi-Markov environment.In this dissertation,first of all,we combine the idea of sensitivity optimization,and get the optimality sufficient conditions for the reward in SMDPs by analyzing the special structure of performance difference fomulas between any two different policies.Then we obtain the inverse reinforcement learning algorithm for SMDPs based on the performance sensitivity,which directly rebuilds reward values via convex optimization and mostly applies to problems with small state space.For inverse reinforcement learning problems with large state space,a method of approximating the rate function in SMDPs with linear combination of feature basis is introduced.Three different inverse reinforcement learning algorithms are researched from different spaces,and they all converting the problem of constructing the reward function to the research on adjusting the weights of feature basis.In the value function space,we give the apprentice inverse reinforcement learning algorithm for SMDPs.Feature expectation of a policy is introduced to represent the average reward,thus rebuilding the reward function by the indirect match in perforcemence.In the policy space,we combine the loss function with natural gradient of SMDPs,and give the inverse reinforcement learning algorithm based on natural policy gradient.The hidden reward function is constructed for SMDPs with the incremental updating method by the direct match in policy.In the probability space,we give the maximum entropy inverse reinforcement learning algorithm for SMDPs based on the probability model.With the likelihood function and the maximum entropy theory,we finally rebuild the reward function by learning the collecting samples of demo trajectories.We finally verify the convergence and effectiveness of these algorithms with two simulation platforms including the grid maze with hallways and unmanned vehicle system,thus providing effective methods in different spaces on the study of reward function in SMDPs and expanding the application field of inverse reinforcement learning theory.The research is of great significance for the further study and applications in inverse reinforcement learning.
Keywords/Search Tags:inverse reinforcement learning, performance sensitivity, feature basis, maximum entropy, semi-Markov decision process
PDF Full Text Request
Related items