Font Size: a A A

Sample Efficiency In Reinforcement Learning

Posted on:2019-07-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:L P ZhangFull Text:PDF
GTID:1368330551956860Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Reinforcement learning(RL)is one of the main branches of machine learning.It mainly concerns about how to find out the optimal control policy from the interac-tions with the learning environment.State-of-the-art RL algorithms usually need a large amount of data in order to achieve good performance,which limits the possibility of ap-plications in which data is expensive.To reduce the amount of data RL algorithms need,it is necessary to have a deeper understanding about their sample efficiency.Recent the-oretical studies can give some hints to the relationships between algorithms,problem instances,and sample efficiency.However,these analyses focus too much on the hard-est problem instances,and thus cannot predict the sample efficiency on average cases precisely.As a result,practitioners and researchers cannot use these theoretical results to compare/choose algorithms,set their parameters,or further improve them.This paper aims to obtain more accurate and precise theoretical sample efficiency results by two means:improving existing analysis methods,and proposing new analy-sis methods.In our first work,we propose stopping time sample complexity analysis based on the existing PAC-MDP analysis framework,in order to make the results reflect the dynamic properties of the problems better.We then propose Increasingly Cautious Optimism principle(ICO),and apply it to two existing PAC-MDP algorithms in order to improve their actual sample efficiency.We apply our stopping time sample complexity analysis to the ICO algorithms,showing that in non-hardest problem instances,the ICO versions have better theoretical sample efficiency than the original ones.Empirical re-sults show that better actual sample efficiency can be achieved by the ICO algorithms,indicating that our improvement to the PAC-MDP analysis is helpful for comparing and improving RL algorithms.In our second work,we propose RL success probability analysis,which describes the relation between algorithms,problem instances,and sample efficiency,directly through mathematical expressions.In this way,the analysis results can tell exactly what effects can be made by tuning the parameters of RL algorithms.We present our detailed analysis of the success probability that a prototype algorithm,executed on chain MDPs,discovers a desired policy at the end of learning.The mathematical expression of the success probability is derived as a function of the environment dynamics and algorithm parameters.We also provide a log-normal approximation to the success probability to simplify the computation.Experiments show that our analysis is able to predict the actual sample efficiency on chain MDPs and maze MDPs with high accuracy and pre-cision.In our third work,we take a deeper look into the factors which make different problem instances have different difficulties in terms of sample efficiency,discovering that the the skewness of estimated values is one of these key factors.We generalise some results we have found in our second work,showing that an estimated state value can be decomposed to a weighted sum of path-wise state values,and the latter follow log-normal distributions which are positively skewed.Therefore,the distributions of estimated values are convolutions of positively skewed log-normal distributions and negatively skewed flipped log-normal distributions.The distributions of estimated val-ues thus can be either positively or negatively skewed,depending on the impacts of positive and negative rewards.Positively skewed estimated values tend to be under-estimated,while negatively skewed ones tend to be overestimated,which make it very difficult to decide which actions have higher true values.We derive the expressions of directions and scales of such skews as functions of environment dynamics and sam-ple sizes,and propose several countermeasures accordingly,in order to help reduce the disruption made by those skews.
Keywords/Search Tags:Machine learning, reinforcement learning, sample efficiency, exploration strategies, PAC analysis
PDF Full Text Request
Related items