Font Size: a A A

Abstraction-based Reinforcement Learning Algorithms And Its Quantization

Posted on:2022-12-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:X C ZhuFull Text:PDF
GTID:1480306764960359Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Reinforcement learning(RL)defines the problems faced by agents that learn to make good decisions through actions and observations alone.To be effective problem solvers,such agents must efficiently exploit limited data computational resources and perceptual bandwidth to explore complex environments,gain credit from delayed feedback,and generalize new experiences.For all these tasks,abstraction is essential.Through abstraction,agents can form compact models of the environment that meet the many practical needs of a rational,adaptive decision-maker.Recently,quantum computation techniques were successfully applied in the field of RL.Based on quantum superposition and entanglement properties,quantum reinforcement learning(QRL)performs better on a more extensive search space,learns faster,and achieves a better balance between exploration and exploitation than traditional RL.This dissertation proposes new abstraction-based classical RL algorithms and QRL algorithms.By learning abstractions,algorithms can help agents explore the environment faster,learn policies more accurately,and preferably extrapolate to new tasks.The main contents and innovations of this dissertation are as follows:This dissertation proposes the Minimum Degree and Maximum Distance(MDMD)action abstraction generation method to accelerate agent exploration in sparse reward domains by reducing the expected cover time of the environment.Specifically,this method heuristically selects two non-adjacent vertices of the state transition matrix with the minimum degree and the maximum distance as the action abstraction.The action abstraction generated by the proposed method can achieve a lower expected cover time of environment than the action abstraction generated by other action abstraction generation methods,resulting in better accelerating agent exploration in sparse reward domains.Experimental results show that this method outperforms other action abstraction generation methods in six challenging sparse reward environments.This dissertation proposes the Wasserstein Deterministic Information Bottleneck State Abstraction(WDIBS)method to address the trade-off between state compression and decision performance.The distribution difference between the state-compressed and expert policies is calculated using the Wasserstein distance.Even if the two distributions do not have precisely overlapping support sets,the Wasserstein distance can still reflect their actual differences,ensuring that WDIBS has good decision performance at low information rates.Theoretical analysis and experiments demonstrate the superiority of this method in achieving a balance between state compression and decision performance over previous methods.This dissertation proposes a quantum episodic memory deep Q-network method,which utilizes episodic memory to accelerate the training process of quantum agents while using state abstraction to compress the state of the original state space.Specifically,the model records the experience with high reward value in history into episodic memory so that when the state of the current environment is similar to a specific state in episodic memory,the quantum agent can quickly obtain the desired action according to the historical state,thus reducing the number of iterations for algorithm optimization.Numerical simulations on five classic Atari games show that this method achieves higher scores and lower running of algorithms than other QRL methods.This dissertation proposes the Gradient Penalty based Wasserstein Adversarial Proximal Policy Optimization(GPWAPPO)method,which uses the Proximal Policy Optimization(PPO)algorithm to learn to match the state abstraction between the original domain and the target domain to achieve visual transfer for RL.Significantly,to strengthen the Lipsitz constraint,this method introduces a gradient norm about the relative input of the source task and the target task,which enhances the algorithm's stability.Validated by experiments in Visual Cartpole and 16 Open AI Procgen environments,the proposed method achieves better performance than previous methods.This dissertation presents a Trust Region-based PPO with Rollback for QAS(QASTR-PPO-RB)method,which can automatically construct quantum circuit architectures using only a tiny amount of physical knowledge.Specifically,the proposed method employs an improved clipping function to implement the rollback behavior and limit the probability ratio between the new and old strategies.Moreover,QAS-TR-PPO-RB uses the triggering condition of the clipping based on the trust domain to optimize the policy by restricting the policy within the trust domain,which leads to guaranteed monotone improvement.Experiments on several multi-qubit circuits demonstrate that this method achieves better policy performance and lower algorithm running time than other methods.
Keywords/Search Tags:Reinforcement Learning, Quantum Reinforcement Learning, State Abstraction, Action Abstraction, Quantum Computation
PDF Full Text Request
Related items