| In recent years,as the problem of perfect information games like Go has been continuously overcome,people have turned their attention to more complex games of imperfect information like poker.The extended game of imperfect information has very complete mathematical properties and is the basis of many scenes in life.Good modeling is closely related to other research methods in the field of artificial intelligence,such as tree search,and is known as the "philosopher" in the field of artificial intelligence.The collaborative development of algorithms has great reference value.In the imperfect information game,the two-person zero-sum game is the most representative.The optimal solution of the two-person zero-sum game exists and is unique,which can be obtained by solving the Nash equilibrium.In the method of calculating the two-person zero-sum game Nash equilibrium,the online convex The optimization algorithm has theoretical superiority far beyond other algorithms.However,the online convex optimization algorithm(OCO)still faces great challenges in solving extended game problems,which are mainly reflected in the following two aspects:First,the theoretical performance has not been fully released in engineering practice,and the real The convergence speed is obviously lagging behind;second,the applicable game scale is relatively small,and the solution efficiency of directly applying to solve large-scale game problems is not high.This paper focuses on the aforementioned challenges faced by online convex optimization algorithms in solving two-player zero-sum game problems.Aiming at the problem of slow convergence in practice,this paper proposes Adaptive Optimistic Online Mirror Descent(Ada-OOMD).By studying the iterative process of the virtual regret minimization algorithm(CFR),it can be found that its faster convergence speed is mainly due to the adaptive adjustment of the accumulated regret value in the algorithm update.Based on this discovery,this paper attempts to introduce adaptive features into the optimistic online mirror descent algorithm(OOMD),and obtains the Ada-OOMD algorithm.In the improved algorithm,the regularization function of policy update is adjusted from a fixed value to a time-varying value.This paper not only gives the specific expression of adaptive parameter adjustment,but also gives the proof of the convergence of this adaptive improvement at the theoretical level.At the same time,the performance of the algorithm is tested on a variety of small and medium-sized game games.The experimental results show that compared with the OOMD algorithm,the real convergence speed of the Ada-OOMD algorithm has been increased by at least two orders of magnitude.In some games,it is close to the convergence speed of the current optimal algorithm CFR+.In other test games,it even exceeds the CFR+algorithm.Aiming at the low efficiency of solving large-scale game problems,this paper combines the neural network with the Ada-OOMD algorithm and proposes the Deep-AdaOOMD algorithm.The new algorithm uses Monte Carlo importance sampling to obtain training samples,which avoids traversing all nodes and reduces computing requirements and storage space.Secondly,the neural network is used to fit the virtual value and game strategy of players at each game node.On the one hand,there is no need to maintain the value table of accumulated regrets of all nodes,which further reduces the storage space;on the other hand,using the powerful generalization ability of the neural network,Correlation metrics for unsampled nodes can be fitted.This paper verifies the feasibility of the algorithm on the large-scale poker game FHP,and compares it with the current representative algorithm.The experimental results show that,Deep-Ada-OOMD algorithm has faster convergence speed and higher sampling efficiency. |