Font Size: a A A

Research On Policy Gradient Methods With Variance Related Risk Criteria

Posted on:2017-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:D XuFull Text:PDF
GTID:2308330488461931Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Reinforcement learning, as an important branch of machine learning, can learn by interacting with the environment continuously even though it has no label data or the actural knowledge of the model. The goal of most reinforcement learning algorithms is to maximize the average reward or the long-term accumulative(discount) reward. However, for the risk-sensitive problems such as control, finance and clinical decision, managing risk is as important as maximizing this traditional goal. The risk we consider in this paper is the variance of the rewards. The new goal of this kind of reinforcement learning algorithms is to maximize the variance related risk criteria which combines the traditional goal and the variance. This paper focuses on the risk-sensitive reinforcement learning problems. We use the new variance related goal with some policy gradient algorithms and propose some new policy gradient algorithms with variance related risk criteria.i. For off-policy actor-critic algorithm, one of its advantages is that it can use an exploration behaviour policy which is different from its exploitation evaluation policy. While its variance will increase when it uses this kind of behaviour policy. For risk-sensitive problems, it cannot work well due to the high variance. In order to reduce the variance of off-policy actor-critic algorithm, we use the variance related risk criteria as the goal and propose a variance related off-policy actor-critic algorithm named VOPAC. Then we give the proof of its convergence and compare it with some other algorithms by a continuous state control problem to illustrate its properties.ii. The temporal difference and eligibility trace methods are benefit for the temporal credit assignment and are widely used in reinforcement learning. True online TD(?) algorithm can obtain the backward view with eligibility traces which is exactly equal to its forward view. We propose a variance related true online TD(?) algorithm named VPGTD(?) to manage the risk and prove the equivalence of its forward view and backward view. Also, we illustrate its properties about the variance control by a continuous state control problem with some other algorithms.iii. The two work mentioned above is based on Markov decision processes(MDPs). Besides, partially observable Markov decision processes(POMDPs) are as important as MDPs in reinforcement learning. Just as the name implies, POMDPs observe the environment partially, so they have the higher uncertainties which induce the higher variance. Aiming at reducing the variance, we add the value functions into the policy gradient algorithm and propose ACIS. Furthermore, a variance related ACIS named VACIS is proposed for managing the risk further. The experimental results show that both these two algorithms can reduce the variance and lead to the better performances.
Keywords/Search Tags:reinforcement learning, variance related risk criteria, policy gradient, temporal difference, POMDP
PDF Full Text Request
Related items