Incremental least-squares temporal difference learning

Posted on:2008-03-27

Degree:M.Sc

Type:Dissertation

University:University of Alberta (Canada)

Candidate:Geramifard, Alborz

Full Text:PDF

GTID:1448390005459252

Subject:Computer Science

Abstract/Summary:

Sequential decision making is a challenging problem for the artificial intelligence community. It can be modeled as an agent interacting with an environment according to its policy. Policy iteration methods are a popular approach involving the interleaving of two stages: policy evaluation to compute the desirability of each state with respect to the policy, and policy improvement to improve the current policy with respect to the state values. The effectiveness of this approach is highly dependent on the effectiveness of policy evaluation, which is the focus of this dissertation. The per time step complexity of traditional methods like temporal difference learning (TD) are sublinear in the number of features. Thus, they can be scaled to large environments, however they use training data relatively inefficiently and so require a large number of sample interactions. The least-squares TD (LSTD) method addresses the data inefficiency of TD by using the sum of the TD updates on all past experiences. This makes LSTD a formidable algorithm for tackling problems where data is limited or expensive to gather. However, the computational cost of LSTD cripples its applicability in most large environments. We introduce an incremental version of the LSTD method, called iLSTD, for online policy evaluation in large problems.;On each time step, iLSTD uses the sum TD update vector in a gradient fashion by selecting and descending in a limited set of dimensions. We show that if a sparse feature representation is being used, the iLSTD algorithm's per time step complexity is linear in the number of features whereas for LSTD, it is quadratic. This allows iLSTD to scale up to large environments with many features where LSTD cannot be applied. On the other hand, because iLSTD takes advantage of all data on each time step, it requires far less data than the TD method. Empirical results in the Boyan chain and mountain car environments shows the superiority of iLSTD with respect to TD and the speed advantage of iLSTD with respect to LSTD. We also extend iLSTD with eligibility traces, resulting in iLSTD(lambda), and show that the additional computation does not change the linear per time step complexity. Additionally, we investigate the performance and convergence properties of iLSTD with different dimension selection mechanisms. Finally, we discuss the limitations of this study.

Keywords/Search Tags:

LSTD, Ilstd, Per time step complexity, Policy

Related items

1	Research On Low-complexity DOA Estimation Method Based On Spectral Peak Slope And Variable Step Size Search
2	Research On The Key Technology Of Networked STEP-NC Controller
3	Research Of Digital Calibration Algorithm For Time-Interleaved ADCs
4	Research On Execution-time Policy Evaluation And Policy Evolution In Open Environments
5	Research On The Fiscal And Tax Policy Of Publishing Industry Based On Text Analysis
6	A Class Of Non-deterministic Finite Automata Minimization And Time Complexity
7	Research On Least-Squares Policy Iteration Algorithms
8	Affine Projection Algorithm, The Number Of Step Length And Step In The Study
9	TIME ESTIMATION: A FUNCTION OF JUDGED PLEASANTNESS AUDITORY COMPLEXITY AND MUSICAL SOPHISTICATION OF THE SUBJECT
10	The Research Of The Complexity Of Sleep EEG And Its Automatic Sleep Stage Scoring