For a long time,the reinforcement learning algorithm cannot directly process the original sensory data from the environment when solving the control decision problem that interacts with the environment,so the application range is very limited,and the known successful applications are highly dependent on the characteristics of artificial design.The development of deep learning in recent years has pushed the research of artificial intelligence to a new peak.One of the important results is that deep neural networks can automatically extract features in high-dimensional images,even better than manually labeled features.Therefore,the deep reinforcement learning algorithm formed by incorporating deep learning techniques into reinforcement learning algorithms has gradually become a new research direction in the field of reinforcement learning.However,training deep neural networks with training samples generated by the reinforcement learning algorithm needs to eliminate high correlation between samples.Currently available methods are based on the experience replay technology of a single agent,and relatively independent training samples are obtained by randomly sampling historical experience data.However,the experience replay requires a lot of memory,and the network has limited iteration speed.This paper uses multiple agents to execute parallel and generate training samples that are independent of each other,and then mix them to be finally used in the training of the network to solve the control decision problem in a complex game environment.The specific work is as follows: First,a set of pre-processing programs for the game environment is designed to facilitate the network training while reducing the calculation scale.Then,the idea of Actor-Critic algorithm based on policy gradient combined with multi-step TD method is improved to reduce the estimation bias of return value.Then a deep convolutional neural network structure is designed to approximate the cost function and policy function in the algorithm,and complete the feature expression in various complex game environments.Finally,a parallel implementation framework based on multi-producer and single-consumer is designed.Through the cooperation of producers composed of multiple agents and predictive threads and the training threads as consumers,the correlation between training samples is eliminated,and the efficiency of network training is improved.Experiments show that the method of training the value network and strategy network by mixing training samples generated by multiple agents can indeed eliminate the correlation between samples and can output the optimal strategy stably.And in the five game environments tested in this paper,the performance of the algorithm exceeds the level of human players.At the same time,compared with deep Q-learning algorithm based on experience replay and GA3 C algorithm based on multi-agent parallel,this method has obvious improvement in training speed and final performance. |