Regularized Maximum Entropy Imitation Learning Based On Prior Reward Of Trajectory

Posted on:2022-03-27

Degree:Master

Type:Thesis

Country:China

Candidate:G Lu

Full Text:PDF

GTID:2518306464966419

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Different from traditional reinforcement learning,in imitation learning,the task of the agent is to learn from the demonstration data of experts.In a broad sense,imitation learning includes behavior cloning that directly learns the behavior of experts and appren-tice learning that learns the reward function first,and then learns behavior accordingly.Based on the ideas of these two learning methods,this article further combines cutting-edge imitation learning research.According to the regularization constraint solving idea of ill-posed problems,a constrained maximum entropy imitation learning algorithm based on the trajectory reward prior is proposed,so that in the learning process,we make the agent obtain good robustness.First of all,the algorithm draws on the maximum entropy strategy learning frame-work in stochastic reinforcement learning,and uses it in imitation learning,so that the strategy finally learned by the agent has as many ways as possible to approach the set goal.Next,the algorithm introduces the concept of reward function constraints,which alleviates the overfitting problem of strategy learning through the sparsity of the reward prior,and considers the dynamic transition probability of the environment,so that the agent can learn multi-step planning strategies in the environment.Further,the algorithm introduces the concept of regularized residual,and through the prior estimation of tra-jectory data rewards,the regularization strength is differentiated to encourage the agent to perform actions corresponding to expert behaviors in the corresponding environment state.Finally,the algorithm treats the trajectory data as positive and unlabeled data,and uses the positive sample distribution variational inference method to train the optimal Bayesian classifier for the trajectory data,so that the reward prior can be dynamic with the process of imitation learning.Based on the Mu Jo Co physics engine of the Open AI Gym platform,this paper selects seven robot control tasks for the deployment of imitating learning tasks.We compared our algorithm with four representative strategies.It can be found that the algorithm pro-posed in this paper is significantly better than other imitation learning algorithms in the comparison.Through the experimental results,it can be found that the idea of the max-imum entropy strategy in the algorithm allows the agent strategy to gradually improve with training.Moreover,the Bayesian classifier of trajectory data relaxes the constraints on excellent behaviors of the agent's current policy.

Keywords/Search Tags:

Imitation Learning, Maximum Entropy Strategy, Positive and Unlabeled Learning, Variational Inference, Inverse Problem and Regularization

PDF Full Text Request

Related items

1	Research On Reinforcement Learning Method For Game Manipulation Behavior Imitation
2	Research On Positive Unlabeled Learning Algorithms For Graph Data Classification And System Implementation
3	A Study On Learning From Positive And Unlabeled Examples
4	Inverse Reinforcement Learning And Imitation Learning With Applications In Intelligent Robotics
5	Bayesian Classifier For Positive Unlabeled Learning With Uncertainty
6	Research On Positive And Unlabeled Learning By Random Forest
7	Intrusion Detection Technology Research Based On Positive-unlabeled Learning
8	Research On Multi-agent Reinforcement Learning Method Based On Stein Variational Gradient Descent
9	Feature Extraction And Selection Based On Subspace Learning And Graph Regularization
10	Research And Application Of Probabilistic Generative Model With Variational Learning And Inference