Font Size: a A A

Research On Reinforcement Learning Algorithm Based On Parallel Sampling And Behavior Induction

Posted on:2024-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:K L ZengFull Text:PDF
GTID:2568306914972529Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Reinforcement learning has achieved considerable success in a variety of fields recently,demonstrating its enormous potential to handle sequential decision problems.The goal of reinforcement learning is to gain experience in new environments and subsequently acquire efficient policies for completing tasks.Because of the coupling effects of elements like the high-dimensional state space of the interactive environment,ambient noise,and sparse rewards,it is challenging for agents to learn optimum strategies from complicated decision-making control tasks.To achieve this,a reinforcement learning method is developed that incorporates parallel sampling and behavior induction with environmental prior knowledge.The specific research content and results are as follows.(1)The hopper experiment based on DeepMind Control Suite is built,and the exploration methods like the Intrinsic Curiosity Module(ICM),Active Pre-Training algorithm(APT),and the state marginal matching(SMM)algorithm are implemented.The benefits and drawbacks of various algorithms are evaluated through analysis of the experimental findings,which offers a theoretical foundation and experimental support for future advancements.(2)Aiming at the problem of low sampling efficiency of state-action pairs in a continuous action environment,a reinforcement learning exploration algorithm(Active Pre-training algorithm for Diverse behavior induction,APD)based on particle entropy and behavior induction is proposed.The state encoder is improved by using the behavior contrast representation loss function,and the high-dimensional state is mapped to the latent space.The particle entropy estimator is introduced to optimize the mutual information objective,and the behavior update mechanism of state sampling is combined to construct a dynamic intrinsic reward,so as to improve the exploration capacity of the unknown environment.The results demonstrate that the exploration efficiency is significantly increased and that the cumulative reward of this method is at least 28%higher than that of previous algorithms in the Jaco Arm environment.(3)Focusing on the issue of low robustness caused by unstable behavior label updating mechanisms during the fine-tuning of downstream tasks,an exploration optimization algorithm based on a variational autoencoder(VAE)and parallel sampling is proposed.A behavior update method is designed using parallel sampling and timing peak detection,which can comprehensively determine the behavior labels of unknown exploration areas.To increase the effectiveness of behavior policies,the VAE encoder is adopted to probabilistically encode action intents,and enhance the particle entropy-based behavior induction model.According to the experimental results,the performance of the proposed algorithm has further improved and its cumulative reward is at least 33%higher than that of other algorithms used in the Jaco Arm environment.
Keywords/Search Tags:reinforcement learning, exploration strategy, intrinsic reward, mutual information, particle entropy
PDF Full Text Request
Related items