With the continuous development of artificial intelligence,Deep deep reinforcement learning(DRL)has received more and more attention from researchers due to its unique advantages.By combining deep learning(DL)and reinforcement learning(RL)together,DRL not only gives end-to-end learning capabilities in high-dimensional environment to reinforcement learning agents,but also makes it possible for machine learning tasks to further improve model performance without t the absence of training samples.Despite great progress made so far,because of the complexity inherited from both DL and RL,when faced with complex learning tasks such as board games,video games,etc.,DRL still suffers from problems such as unstable training,low sample utilization,difficulty in reproducing results,high hyperparameter sensitivity and difficulty in escaping from local optima.This thesis propose DRL approaches to for board games,which is based on convolutional neural networks and Upper Confidence Bound Applied to Trees(UCT)algorithm,and focuses primarily on dealing with the problems mentioned above,from which consists of the following three aspects:(1)In order to improve the quality of samples in the training process,an effective method of for training board game agents using UCT algorithm's search results is proposed.The algorithm uses UCT to reassess the sampling trajectories of the neural network to correct the neural network's deviations.With the growth of the neural network,it is equivalent to reducing the search space of UCT and improving the efficiency of UCT.(2)The method that combines neural network and Monte Carlo tree search(MCTS)not only requires a large number of training samples,but also is difficult to get rid of the misguided search track caused by deviations in the training process.To solve this problem,a learning algorithm that incorporates bootstrap aggregating algorithm is proposed.The algorithm makes nearly full use of the training data generated from the self-playing and supports multiple neural networks to participate in learning and exploration,which ensures the diversity of the search trajectories,and thereby improving improves the stability of the algorithm and reducing reduces the risk of prematurely getting trapped into local optima.(3)In order to avoid the decrease of the performance of UCT algorithm caused by neural network deviations,and to make full use of all the models trained in the algorithm mentioned above,a UCT algorithm with combined strategies is proposed.The new algorithm not only naturally completes the multithreading modification of the UCT algorithm,but also improves the accuracy of the UCT algorithm through the asynchronous search method.In this thesis,the proposed methods are tested and compared in a series of experiments.Experimental results have confirmed the effectiveness of them. |