Font Size: a A A

Inverse Reinforcement Learning Under Average Reward Criterion

Posted on:2014-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z R TaoFull Text:PDF
GTID:2298330422990620Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
In reinforcement learning, reward function is set mainly based on experience.So it’s difficult to guarantee its optimality. Apprenticeship learning also needs tostrike the reward function. Reverse reinforcement learning could construct thepotential reward function by learning demo trajectory or expert’s strategy. Itprovides an effective method for constructing the reward function while avoidingthe subjectivity of experience. Therefore, it’s quite meaningful to research inversereinforcement learning.So far, inverse reinforcement learning mainly focused on Markov DecisionProcess(MDP) under the discount criteria. However, inverse reinforcement learningunder average criterion has almost not been studied. Therefore, this paper studiesreverse reinforcement learning under average criterion to solve the problem ofconstructing reward function. This paper mainly includes two sections. On one hand,based on sensitivity idea, we get a sensitivity-based inverse reinforcement learningalgorithms in small state space environment, which is obtained by analyzing theperformance difference formula under average criterion. On the other hand, in largestate space or in the condition of indescribable reward function, we describe rewardfunction through a linear combination of characteristic basis functions. Combining itwith the maximum marginal idea, zero-sum game thinking and natural gradient idea,we can achieve three kinds of inverse reinforcement learning under average criterion.They are respectively inverse reinforcement learning of the maximum marginal,zero-sum thinking and natural gradient.In this dissertation, the four algorithms are realized in both a grid world andunmanned vehicle simulation platform. The effectiveness of the algorithm is mainlyverified in three sections, including the state numbers of wrong action betweencalculated strategy and expert strategy, the difference of average reward betweencalculated strategy and expert strategy and the value of reward function. In addition,it analyzes how much the four algorithms rely on the expert’s strategy andenvironment and then compares the advantages and disadvantages of the algorithms.
Keywords/Search Tags:reverse reinforcement learning, policy, MDP, feature basis function, average rewards
PDF Full Text Request
Related items