| In the real world,there are many scenarios related to game decision-making,such as commercial contract negotiations,sports tactical arrangements,and defensive confrontations in the military field.With the rapid development of artificial intelligence and the practical application of game theory,a large number of models and algorithms have emerged in time,playing a crucial auxiliary and guiding role in various game decisionmaking scenarios.However,there are many uncertain and unknowable factors in actual game decision-making scenarios.Therefore,studying game decision-making problems with incomplete information has important application and reference value for decisionmaking support and analysis systems in real life.Nash equilibrium is a typical equilibrium solution in the incomplete information game.Counterfactual regret minimization(CFR)algorithm is currently one of the most typical methods to find Nash equilibrium due to its good theoretical guarantee and strong empirical performance.However,the existing CFR and its variant algorithms have their respective advantages and disadvantages in the incomplete information extensive game,and their generalization performance is weak.In addition,even if Nash equilibrium policy is known,the opponent modeling method may still obtain higher returns by utilizing the opponent’s weakness,which is another way to solve incomplete information games.However,the known and stationary assumptions of the opponent will limit the practicality and accuracy of the opponent model.In response to the above issues,the specific work content of this dissertation is as follows:(1)For the optimization problem of counterfactual regret minimization algorithm under incomplete information,this dissertation takes the lead to combine proximal policy optimization(PPO)algorithm in reinforcement learning with the vanilla counterfactual regret minimization algorithm,and proposes a PPO-CFR algorithm.Specifically,by training rational agent and constructing reward function according to the exploitability of CFR algorithm,the PPO agent can select the most appropriate regret calculation and strategy update type at each timestep,so as to generate efficient CFR iterative strategy and improve the generalization performance of the existing CFR algorithms,thereby achieving strategy optimization of incomplete information extensive game.In addition,this dissertation also provides a proof of convergence of PPO-CFR algorithm.At the same time,the PPO algorithm can be applied to both discrete action space and continuous action space.For the selection of regret update types,its universality is stronger.(2)For the problem of opponent modeling in online competition against unknown and non-stationary opponents with incomplete information,this dissertation proposes a new adaptive opponent modeling combined with Exp3.P algorithm of adversarial multiarmed bandit,namely AOM-Exp3.P method.The method is mainly divided into two parts:offline training before playing with opponents and online playing with opponents.In the offline training part,we first learn the opponents’ policy embedding based on conditional variational autoencoder,and then use conditional reinforcement learning to train the approximate best response to potential opponents.In the online testing part,this dissertation converts it into an adversarial two-armed bandit problem,where one bandit is a fixed policy with the best worst performance,and the other bandit is a realtime greedy policy obtained by combining online inferred opponent embedding with offline trained best response.Exp3.P algorithm can be used to real ize an effective choice between fixed policy and greedy policy,allowing for both exploration and exploitation efficiency in each round of online game,and theoretically obtaining a lower regret than the above two policies with a high probability.Experimental verification shows that PPO-CFR algorithm has better generalization performance and lower exploitability than the existing tabular CFR algorithms in general poker games.At the same time,the exploitability performance improvement of PPO-CFR algorithm in Kuhn poker,Leduc poker and Royal poker is at least 16.08%,18.16%and 13.17%respectively,which also indicates that its iterative strategy is closer to Nash equilibrium.In addition,the AOM-Exp3.P method has been experimentally evaluated on Kuhn poker and grid-world predator prey,and it exhibits stronger performance against unknown and non-stationary opponents.In the face of adaptive opponent types,the AOM-Exp3.P method is the least easily exploitable.Meanwhile,the AOMExp3.P method performs best against state-of-the-art methods in terms of average and worst-case performance. |