Fast-PPO:Fast-Proximal Policy Optimization

Posted on:2021-02-13

Degree:Master

Type:Thesis

Country:China

Candidate:Z Xiao

Full Text:PDF

GTID:2428330620464198

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

At present,in deep reinforcement learning methods,most algorithms are limited to low stability and low reproducibility.Some recent methods(such as the near-end strategy optimization algorithm PPO)only restrict the policy update at a lower speed to maintain stability.In this article,we model the problem under the Advantage Actor-Critic algorithm(A2C)architecture for further analysis.Members of the strategy gradient algorithm family basically have the disadvantage of unstable strategy updates.And they are also very limited in discrete states.We propose a new algorithm that combines the advantages of the PPO algorithm with gradient-independent and universality in discrete states,thus obtaining better gradient estimates.By using the optimal baseline,the algorithm raises the return limit and accelerates the convergence rate.We theoretically proved the upper and lower bounds of the reward function in Fast-PPO.And the current popular high-dimensional continuous benchmark experiment was used to test,to illustrate the superiority of the algorithm.In the extended experiment,first of all,in a stable environment,Fast-PPO has a wider application range than other algorithms.It overcomes the shortcomings that Q-learning series algorithms are only used in discrete space and PG(strategic gradient)series algorithms are only used in continuous space.Fast-PPO has certain advantages in both discrete and continuous spaces.Secondly,for multi-intelligence environments,Fast-PPO algorithm is used in Tennis games and football games to achieve multi-agent control.Experiments show that Fast-PPO also has full advantages in multi-agent cooperation and confrontation.Finally,the Fast-PPO algorithm is applied to complex environments,such as corgi picking sticks and drone control.Corgi has some fun in life.The training of drones is the hotspot of military training today,and it is also the key to future victory in air operations.In the current RL algorithm,the Fast-PPO algorithm can almost deal with the current UAV path planning problem in the control of drones,and has certain practical significance.

Keywords/Search Tags:

Deep reinforcement learning, PPO, Fast-PPO, Policy Gradient Algorithm

PDF Full Text Request

Related items

1	Research On Fast Policy Gradient Algorithms Of Reinforcement Learning Based On Adaptive Learning Rate
2	Research On Fast Training Method Of Robotic Arm Based On Deep Reinforcement Learning
3	Robust Policy Gadient Algorithm Based On Actor-Critic In Deep Reinforcement Learning
4	Research On Off-policy Reinforcement Learning Algorithm
5	Deep Reinforcement Learning Based On Policy Gradient Optimization And Its Application In Agent Control
6	Research On Agent Decision-making And Control Based On Deep Reinforcement Learning
7	Deep Deterministic Policy Gradient Based On Entropy Regularization And Regular Update
8	Optimization On Deep Reinforcement Learning Based On Policy Gradient
9	Study Of Robot Arm Control Based On Deep Reinforcement Learning
10	Research On Multiagent Cooperation And Applications Based On Reinforcement Learning