In order to optimize the intersection signal timing, the single intersection online Q learning model is established as the optimization goal of minimizing the difference between the total key queue length through the integrated Excel VBA-Vissim-Matlab simulation platform. Online model is divided into fixed cycle Q learning timing model and variable cycle Q learning timing model. Because performance index is approximate at the adjacent signal timing, the paper puts forward a method of building reward function to increase the gap between different behaviours to improve the robustness and computation speed. The reward function take the difference of the average total critical queue length as the basic unit. The example of Fixed cycle two phase Q learning model shows the correctness of Q-learning model that it can optimize dynamically as traffic flow changing, and use the experience to shorten the learning time. By means of traffic conditions simulation tests on houzishi bridge,the outcome show that the model has a good practical application capabilities.Through the comparison between fixed cycle Q learning timing plan, variable cycle Q learning timing plan and Transyt timing plan, the results show that take the difference between the total key queue length as the optimization target can optimize the time and space resources of the intersection and the online Q learning model has high accuracy, robustness and learning ability. This paper also discusses the performance of variable cycle Q learning timing model in Changing flow conditions. |