Font Size: a A A

Research On Reinforcement Learning Algorithm Based On Improved Action Decision Method

Posted on:2022-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:W H ZhangFull Text:PDF
GTID:2518306746968829Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
A standard deep reinforcement learning algorithm must include an appropriate action decision method.Whether it can balance exploration and utilization has a direct impact on the convergence speed of the algorithm and whether it can learn the optimal strategy.At present,?-greedy method is used for action decision in value based deep reinforcement learning algorithm,but the excessive randomness of ?-greedy method will lead to slow convergence or the agent can't learn the optimal strategy.Among the traditional reinforcement learning algorithms,there is another effective action decision method,the upper confidence bound(UCB),which is characterized by introducing confidence bound on the basis of action value,and then selecting actions with certain strategies.The advantage of UCB is that its exploration is not random,but more rational and strategic,so the results are better than ?-greedy method.However,to estimate the confidence bound span of each action in UCB,it's necessary to use tables to count the number of times each action is selected.In the case of highdimensional state space problems,it is difficult to maintain such a large table in memory,and UCB cannot deal with complex reinforcement learning problems.In this paper,an improved action decision method,Deep UCB,is proposed to solve complex reinforcement learning problems.This method can replace the ?greedy method for action decision in appropriate scenarios,so as to avoid the excessive randomness caused by the ?-greedy method and maintain a higher degree of exploration.Deep UCB mainly consists of three modules:confidence bound span fitting model based on deep neural network,confidence bound span balance model based on target value in reverse order,and exploration-exploitation dynamic balance factor ?.Firstly,a deep neural network called UCB network is built to fit the confidence bound span function.Then,the sum of the output of Q network and UCB network is taken as the UCB value of each action,and the action with the largest UCB value is directly selected.Secondly,confidence bound span values output by UCB network are arranged in reverse order as the target value,and the automatic balance of confidence bound span values is realized by using the characteristics of UCB network back-propagation and gradient descent.Finally,a dynamic balance factor ?,which decays with the number of iterations,is introduced to control the proportion of action value and confidence bound span value in UCB value,so as to balance exploration and exploitation in training process.After the algorithm was designed,the influence of multiple reinforcement learning hyperparameters on the Deep UCB was studied and analyzed,including two unique hyperparameters of Deep UCB:UCB network learning rate and the decay rate of dynamic balance factor ?.The results show that UCB network learning rate controls the updating degree of confidence bound span,which not only affects the training speed,but also affects the fluctuation of reward curve.The decay rate of dynamic balance factor ? will affect the degree of exploration and exploitation,and then affect the training speed.Then,in multiple experimental environments,the training process and test scores of the reinforcement learning algorithm based on Deep UCB and ?-greedy method were compared,and the degree of exploration in the early training was evaluated.The training process and the evaluation shows that Deep UCB can maintain a higher exploration degree than the ?-greedy method,and the reward curve of the training process is greater than or equal to that of the e-greedy method.The test results show that the average score of the algorithm based on Deep UCB is higher than that based on ?-greedy method.Finally,the effects of Deep UCB are compared under different basic algorithms.The results show that the combination of Nature DQN and Deep UCB greatly improves the ability to solve complex problems,and its performance even exceeds that of Double DQN.The experimental results show that in the experimental environment used in this paper,Deep UCB can replace ?-greedy method for action decision and solve complex reinforcement learning problems,and can maintain a higher degree of exploration in the early stage of training,and has good effect in training process and testing.These experiments also show that this paper has made some contributions to the sensitivity of current action decision methods to hyperparameters,basic algorithms and environment.To sum up,the research content of this paper has good theoretical significance and practical value.
Keywords/Search Tags:UCB, Exploration and Exploitation, Deep Reinforcement Learning, Machine Learning
PDF Full Text Request
Related items