Font Size: a A A

Theories, Algortihms And Applications Of Policy Gradient Reinforcement Learning

Posted on:2007-11-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:X N WangFull Text:PDF
GTID:1118360215970505Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
In recent years, reinforcement learning(RL) has been one of the key research areas in artificial intelligence and machine learning. Reinforcement learning is different from supervised learning in that teacher signals are not necessary and a reinforcement learning system learns by interacting with the environment to maximize the evaluative feedback from the environment. Thus, reinforcement learning methods have wide application areas in solving comlex optimization and decision problems, where teacher signals are hard to be obtained.As an important branch of reinforcement learning, policy gradient reinforcement learning overcomes some limitations of value-function-based reinfrocemnt learning algorithms , which includes the inability to guarantee the convergence and to the difficulty in incorporating a priori knowledge. On the other hand, the variance of policy gradient estimation in existing policy gradient algorithms is usually large, so the speed of convergence becomes very slow, which is a significant problem for policy gradient algorithms to be widely applied. Therefore, the research topic of this dissertation, which is supported by the National Natural Science Foundation of China (NSFC) under grants No. 60234030 with the title"research on theory and methods of mobile robot navigation and control in unknown environments", focuses on policy gradient reinforcement learning theory and algorihms and its applications to motion control of lunar rovers. Based on an analysis of the theoretical framework of policy gradient reinforcement learning, two methods have been studied to increase the convergence speed of previous algorithms: One is the reward-baseline method used in policy gradient algorithms and the other is to incorporate a priori knowledge into policy gradient algorithms. The variance of gradient estimation can be reduced efficiently by the reward-baseline method. By incorporating prior knowledge into policy gradient reinforcement learning, the convergence speed can be increased a lot, and the drawbacks of random initial policies can also be overcomed during the initial learning phase. Furthermore, a new adaptive control method based on reinforcement learning is presented for the problem of multi-wheel coordination control of lunar rovers in this dissertation。The main contributions of this dissertation are as following:1. During the research on the theoretical framework of policy gradient reinforcement learning, it is proved that the gradient estimation formulas of all the existing policy gradient algorithms can be uniformed. According to this framework, some current policy gradient algorithms are generalized.2. The applications of reward-baseline methods in policy gradient reinforcement learning for POMDP are studied. A method for optimal reward-baseline to minimize the variance of gradient estimation is presented and the method is proved theoretically. A new policy gradient algorithm with reward baseline—Istate-Grbp to solve POMDP problems is proposed in this dissertation. The variance is reduced in Istate-Grbp algorithm by introducing reward baselines. The experimental results show that the convergence speed of policy gradient algorithms can be increased greatly by reducing the variance.3. The fuzzy policy gradient reinforcement learning, which incorporates a priori knowledge by using fuzzy inference systems, has been studied in this dissertation. Two fuzzy policy gradient reinforcement learning algorithms are proposed for Markov Decision Processes with discrete and continous actions, respectively. In these two algorithms, conclusion parameters of fuzzy rules are tuned using policy gradient methods because it is difficult to specify these parameters. The convergence of the algotihms is proved and the experimental results show the efficiency of the algorithms.4. A hybrid policy gradient reinforcement learning method combined with SVMs (PG-SVM) to incorporate a priori knowledge has been proposed. The PG-SVM algorithms make use of SVMs for initial policy learning and approximation in a policy gradient learning framework, which has not been studied in previous works. By using the policies based on SVMs as the initial policies of PG algorithms, prior knowledge can be automaticly incorporated through the traning data. Thus, the learning control approach has three advantages when compared with existing PG algorithms: (1) Prior knowledge can be easily used only by providing training examples to SVM-based supervised learning; (2) The controller performance can be optimized using online policy gradient RL to compensate unknown disturbances; (3) The controller structure is determined by SVMs, which is data-driven, not predefined.5. For the problem of multi-wheel coordination in motion control of lunar rover, an adaptive control method based on hybrid policy gradient reinforcement learning has been proposed. Due to the complexity of the structure of the lunar rover, classical control methods have some disadvantages and moreover, on-line terrain parameter estimation is needed. A hybrid policy gradient reinforcement learning control method is proposed to solve this complex optimation control problem with difficulty in obtaining teacher signals and designing fuzzy rules. This is a problem with high-dimension continous state space and continous action space, so the previous RL algotihms are very time-consuming and need simulation environments which require the dynamical model and virtual environments for lunar rovers. By incorporating prior information based on training data in the reinforcement learning control method, the learning time is shortened greatly and the on-line performance is guaranteed. Thus, the learning processes can be accomplished completely on the real lunar rover, without any help of simulation environments. This is a significant progress for the practical application of reinforcement learning. The effect of the controller obtained by the method is satisfactory. The directions for future research work are discussed in the last chapter.
Keywords/Search Tags:Reinforcement Learning, Policy Gradient, Policy Search, Machine Learning, Markov Decision Processes, Lunar Rover, Partially Obserable Markov Decision Processes, Prior Knowledge, Multi-wheel Coordination
PDF Full Text Request
Related items