| As artificial intelligence technology blooms,machine learning methods based on reinforcement learning are gradually applied in fields such as autonomous driving,robot control,and intelligent assistants.As most real-world scenarios are partially observable environments,agents only make decisions with local observation.Deep reinforcement learning is often used to solve the optimal coordination in partially observable scenarios for its excellent performance in sequential decision-making problems.However,the uncertainty of the environment makes it difficult for agents to find a universal cooperative strategy,and the coordination performance between agents trained in different batches is often poor.This paper chooses the Hanabi game as a benchmark for partially observable coordination and constructs a reinforcement learning environment.Unspecialized-belief Learning(UBL)and Symmetric Domain Adaptation(SDA)are proposed to solve the problem of coordination with partial observation.Firstly,an important and challenging problem in partially observed coordination is zeroshot coordination(ZSC),where agents are required to cooperate with different partners without prior agreement.Previous deep reinforcement learning algorithms seek strategies that enable agents trained together to cooperate with each other,but these agents often perform poorly when cooperating with other agents and human players.Some studies have pointed out that this failure stems from the prior assumption formed by agents during training,which we call specialized beliefs.We construct a model that describes the formation of specialized beliefs,and propose UBL based on the model.UBL is a method that controls belief specialization by avoiding sample imbalance and sample scarcity.UBL can significantly reduce specialized beliefs and has reached 24.03/25 in the benchmark test Hanabi.Secondly,based on zero-shot coordination,we explore the possibility of few-shot coordination(FSC),where agents are required to cooperate with other partners after several rounds of coordination without prior agreement.Current work mainly focuses on zero-shot coordination,but most coordination problems in the real world are achieved through some special assumptions(such as whether vehicles on the road should drive on the left or right).In few-shot coordination,AI agents are expected to understand the intentions and specialized beliefs of other agents and achieve coordination after a few interaction steps.This paper proposes SDA for dynamically matching adaptive strategies with symmetric environment features between agents,and explore the possibilities for increasing the number of samples through data augmentation.Experiments show that SDA has reached 16.70/25 in few-shot coordination of Hanabi.We construct the Hanabi reinforcement learning benchmark environment,propose the UBL algorithm to reduce specialized beliefs in ZSC setting,explore the coordination between agents with symmetric strategies,and design data augmentation methods to address the problem of sparse samples in FSC settings.This paper provides insights into the application of reinforcement learning in partially observable coordination scenarios. |