Font Size: a A A

Incremental least-squares temporal difference learning

Posted on:2008-03-27Degree:M.ScType:Dissertation
University:University of Alberta (Canada)Candidate:Geramifard, AlborzFull Text:PDF
GTID:1448390005459252Subject:Computer Science
Abstract/Summary:
Sequential decision making is a challenging problem for the artificial intelligence community. It can be modeled as an agent interacting with an environment according to its policy. Policy iteration methods are a popular approach involving the interleaving of two stages: policy evaluation to compute the desirability of each state with respect to the policy, and policy improvement to improve the current policy with respect to the state values. The effectiveness of this approach is highly dependent on the effectiveness of policy evaluation, which is the focus of this dissertation. The per time step complexity of traditional methods like temporal difference learning (TD) are sublinear in the number of features. Thus, they can be scaled to large environments, however they use training data relatively inefficiently and so require a large number of sample interactions. The least-squares TD (LSTD) method addresses the data inefficiency of TD by using the sum of the TD updates on all past experiences. This makes LSTD a formidable algorithm for tackling problems where data is limited or expensive to gather. However, the computational cost of LSTD cripples its applicability in most large environments. We introduce an incremental version of the LSTD method, called iLSTD, for online policy evaluation in large problems.;On each time step, iLSTD uses the sum TD update vector in a gradient fashion by selecting and descending in a limited set of dimensions. We show that if a sparse feature representation is being used, the iLSTD algorithm's per time step complexity is linear in the number of features whereas for LSTD, it is quadratic. This allows iLSTD to scale up to large environments with many features where LSTD cannot be applied. On the other hand, because iLSTD takes advantage of all data on each time step, it requires far less data than the TD method. Empirical results in the Boyan chain and mountain car environments shows the superiority of iLSTD with respect to TD and the speed advantage of iLSTD with respect to LSTD. We also extend iLSTD with eligibility traces, resulting in iLSTD(lambda), and show that the additional computation does not change the linear per time step complexity. Additionally, we investigate the performance and convergence properties of iLSTD with different dimension selection mechanisms. Finally, we discuss the limitations of this study.
Keywords/Search Tags:LSTD, Ilstd, Per time step complexity, Policy
Related items