Font Size: a A A

Fast-PPO:Fast-Proximal Policy Optimization

Posted on:2021-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:Z XiaoFull Text:PDF
GTID:2428330620464198Subject:Computer technology
Abstract/Summary:PDF Full Text Request
At present,in deep reinforcement learning methods,most algorithms are limited to low stability and low reproducibility.Some recent methods(such as the near-end strategy optimization algorithm PPO)only restrict the policy update at a lower speed to maintain stability.In this article,we model the problem under the Advantage Actor-Critic algorithm(A2C)architecture for further analysis.Members of the strategy gradient algorithm family basically have the disadvantage of unstable strategy updates.And they are also very limited in discrete states.We propose a new algorithm that combines the advantages of the PPO algorithm with gradient-independent and universality in discrete states,thus obtaining better gradient estimates.By using the optimal baseline,the algorithm raises the return limit and accelerates the convergence rate.We theoretically proved the upper and lower bounds of the reward function in Fast-PPO.And the current popular high-dimensional continuous benchmark experiment was used to test,to illustrate the superiority of the algorithm.In the extended experiment,first of all,in a stable environment,Fast-PPO has a wider application range than other algorithms.It overcomes the shortcomings that Q-learning series algorithms are only used in discrete space and PG(strategic gradient)series algorithms are only used in continuous space.Fast-PPO has certain advantages in both discrete and continuous spaces.Secondly,for multi-intelligence environments,Fast-PPO algorithm is used in Tennis games and football games to achieve multi-agent control.Experiments show that Fast-PPO also has full advantages in multi-agent cooperation and confrontation.Finally,the Fast-PPO algorithm is applied to complex environments,such as corgi picking sticks and drone control.Corgi has some fun in life.The training of drones is the hotspot of military training today,and it is also the key to future victory in air operations.In the current RL algorithm,the Fast-PPO algorithm can almost deal with the current UAV path planning problem in the control of drones,and has certain practical significance.
Keywords/Search Tags:Deep reinforcement learning, PPO, Fast-PPO, Policy Gradient Algorithm
PDF Full Text Request
Related items