Font Size: a A A

Research On Reward Optimization In Reinforcement Learning

Posted on:2021-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z WangFull Text:PDF
GTID:2518306020982809Subject:Detection Technology and Automation
Abstract/Summary:PDF Full Text Request
The design of reward function in reinforcement learning is very important.Bad reward function design will lead to unstable convergence or even failure of reinforcement learning algorithm.This paper takes the reward function in reinforcement learning as the research object,and analyzes the design of the reward function in the traditional reinforcement learning method.Two optimization methods are proposed based on Deep deterministic policy gradient(DDPG).One is to design an RFPG algorithm for scenarios with clear target,and the other is a reward decomposition algorithm RD3 designed for general scenarios.Both methods were successfully applied in the experiment to avoid the problems caused by the design of the reward function.The algorithm designed in this paper is as follows:The RFPG algorithm is a reinforcement learning algorithm designed for unrewarded functions with clear target scenarios.The formulation of the reward function is a very important step in reinforcement learning.But in many systems,the reward function is not easy to formulate.Different reward functions will get different results.To deal with this problem,this paper first designed a state quantity mechanism to replace the original complicated reward function.Next,the prediction function is used to iteratively update itself,so the RFPG algorithm can avoid the design of the reward function in a specific environment.Finally,the algorithm was successfully applied in Frozen Lake and Cart Pole environments,and achieved good performance,proving the effectiveness of the algorithm.However,it has certain applicable conditions,which restricts the large-scale application of the algorithm.The RD3 algorithm is a reinforcement learning algorithm designed using the decomposition mechanism of the reward function.The algorithm reduces the impact of the design of the reward function by decomposing the correlation between the reward units,and is suitable for general reinforcement learning scenarios.The single evaluation network of the classic reinforcement learning algorithm will cause the return value to be erroneously overestimated.Eventually,the algorithm will converge unstable when applied in a complex environment.First,for the general reinforcement learning environment,the RD3 algorithm decomposes the reward function into multiple reward function units to improve the exploration performance and network convergence speed.Then take the traffic signal control system as the research object and build a single intersection with complex parameters based on the SUMO simulation environment.Finally,in view of the complexity of the urban transportation system,a complex gridlike urban transportation network Grid Network was built.Compared with the DDPG algorithm,the convergence of the RD3 algorithm is faster and more stable.Experimental results show that the algorithm can effectively reduce vehicle waiting time,improve traffic efficiency and reduce pollution emissions.This paper proposes two algorithms,RFPG and RD3,in order to solve the problem of complex and unstable reward function design.The effectiveness of the two algorithms has been verified in a variety of experimental environments,and to a certain extent,the dependence of reinforcement learning on the design of the reward function has been eased.
Keywords/Search Tags:Optimal reward problem, Reinforcement learning, Reward free, Reward decomposition
PDF Full Text Request
Related items