Font Size: a A A

Research On Off-policy Reinforcement Learning Algorithm

Posted on:2022-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:C MaFull Text:PDF
GTID:2518306602955439Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
The ultimate goal of machine learning is to realize artificial intelligence in the true sense,but at present,supervised learning,unsupervised learning,and reinforcement learning have not done this.We hope to build an artificial intelligence body that can perceive,think and make decisions.Reinforcement learning is closer to the artificial intelligence we expect at this point.Reinforcement learning is an important branch in the field of artificial intelligence.Its ultimate goal is to let the agent learn how to automatically explore the environment without manual intervention,and finally find an optimal policy through self-learning.This thesis mainly focuses on the research of different policy algorithms in the field of reinforcement learning.The different policy algorithm is that the sampling policy used by the agent in the environment and the evaluation policy are not the same policy.Generally,a policy is first used to sample to obtain a large number of samples,and then the algorithm uses the samples obtained from these non-optimal policies to find the target policy.This has great advantages in practical applications.1.We take the basis function selection problem in the least squares algorithm as the starting point.In most algorithms based on least squares,the basis function is often given based on experience or trial and error.But the type and number of basis functions play a vital role in the expression of the final value function.To solve this problem,we introduced Orthogonal Matching Pursuit(OMP)to filter the basis functions in reinforcement learning.At the same time,we combine the advantages of the OMP algorithm and use the latest least squares algorithm with gradient correction term to perform sparse solution fitting in each step of the algorithm iteration,and realize the deep combination of the two algorithms.We named this algorithm OMP-TDC algorithm.The paper conducts experiments on the proposed algorithm in two reinforcement learning benchmark environments.The main purpose of the experiment is to study the basis function screening ability of the new algorithm and the convergence performance of the final algorithm.2.The paper aims to improve the Deep Deterministic Policy Gradient(DDPG)algorithm in deep reinforcement learning.In view of the high dependency problem of Actor and Critic in DDPG and the overestimation problem that Critic is prone to appear,we propose A DoubleNet DDPG algorithm with double Actor and double Critic is proposed,and a relearning framework is proposed based on the instability problem in the learning process.And when the two algorithms are combined,the optimal action selection mechanism and the target network asymmetric delay soft update mechanism are added.The paper uses the Mujoco environment under OpenAI gym to experiment with the algorithm,and shows the learning curve in the algorithm learning process.Compared with the original algorithm,the newly proposed algorithm has different degrees of improvement in the final score and learning stability.
Keywords/Search Tags:Reinforcement learning, deep reinforcement learning, least squares, orthogonal matching pursuit algorithm, deep deterministic policy gradient algorithm
PDF Full Text Request
Related items