Font Size: a A A

Regularized Maximum Entropy Imitation Learning Based On Prior Reward Of Trajectory

Posted on:2022-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:G LuFull Text:PDF
GTID:2518306464966419Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Different from traditional reinforcement learning,in imitation learning,the task of the agent is to learn from the demonstration data of experts.In a broad sense,imitation learning includes behavior cloning that directly learns the behavior of experts and appren-tice learning that learns the reward function first,and then learns behavior accordingly.Based on the ideas of these two learning methods,this article further combines cutting-edge imitation learning research.According to the regularization constraint solving idea of ill-posed problems,a constrained maximum entropy imitation learning algorithm based on the trajectory reward prior is proposed,so that in the learning process,we make the agent obtain good robustness.First of all,the algorithm draws on the maximum entropy strategy learning frame-work in stochastic reinforcement learning,and uses it in imitation learning,so that the strategy finally learned by the agent has as many ways as possible to approach the set goal.Next,the algorithm introduces the concept of reward function constraints,which alleviates the overfitting problem of strategy learning through the sparsity of the reward prior,and considers the dynamic transition probability of the environment,so that the agent can learn multi-step planning strategies in the environment.Further,the algorithm introduces the concept of regularized residual,and through the prior estimation of tra-jectory data rewards,the regularization strength is differentiated to encourage the agent to perform actions corresponding to expert behaviors in the corresponding environment state.Finally,the algorithm treats the trajectory data as positive and unlabeled data,and uses the positive sample distribution variational inference method to train the optimal Bayesian classifier for the trajectory data,so that the reward prior can be dynamic with the process of imitation learning.Based on the Mu Jo Co physics engine of the Open AI Gym platform,this paper selects seven robot control tasks for the deployment of imitating learning tasks.We compared our algorithm with four representative strategies.It can be found that the algorithm pro-posed in this paper is significantly better than other imitation learning algorithms in the comparison.Through the experimental results,it can be found that the idea of the max-imum entropy strategy in the algorithm allows the agent strategy to gradually improve with training.Moreover,the Bayesian classifier of trajectory data relaxes the constraints on excellent behaviors of the agent's current policy.
Keywords/Search Tags:Imitation Learning, Maximum Entropy Strategy, Positive and Unlabeled Learning, Variational Inference, Inverse Problem and Regularization
PDF Full Text Request
Related items