| Deep reinforcement learning combines both perceptual capability and autonomous decision-making ability,which has been successfully applied to complex scenarios such as Go,gaming,and robot control.Using DRL methods to solve real-world tasks is a major research trend in the future.Given that most complex scenarios involving multiple interactable entities,multi-agent reinforcement learning(MARL)has emerged.Substantial studies have been researched on MARL methods.However,in the real world,the complexity of the task makes existing MARL methods mainly face the challenge of sample efficiency,i.e.,a large amount of interaction data is required to achieve a satisfactory level of performance.The core of this problem involves the model training process and the agents’ interaction process,which is manifested in four challenges:1)Insufficient model robustness makes the model performance instable,leading to difficulties in continuously and stably improving the policy,which in turn reduces the learning efficiency of the policy during the training process;2)Limited state representation ability makes the model not sufficiently learn the features of high-dimensional states,which affects the approximation effect of the value function and the update direction of the policy,thus leading to unsatisfactory sample efficiency during training;3)Lack of decision fairness makes the biased action selection,affecting the diversity of data collection during interaction process and the ability of agents to explore the complex environment,and further reducing the learning efficiency;4)High complexity of the solution space exacerbates the challenge of environment exploration by agents,which affects the interaction efficiency and makes the policy solving process difficult and inefficient.To this end,this thesis proposes the sample efficiency optimization methods for MARL,which mainly focuses on optimizing the model training process and the agent interaction process.The main research contents are as follows:(1)Bayesian value function-based multi-agent reinforcement learningExisting MARL methods suffer from insufficient robustness due to poor generalization and inaccuracy of Q-value estimation,which hinders its application to practical tasks.To address this challenge,this paper proposes a MARL method based on the Bayesian value function(BMARL).Specifically,Bayesian parametrically modeling the distributional value function module is designed to provide a more generalized and accurate estimation of the action-state value in MARL,which assists the agent in deciding actions and is beneficial to improving learning efficiency and the stability of model training process.Besides,Gaussian prior is incorporated during training,reducing data dependency and improving training efficiency.Comprehensive experiments are conducted on the multi-agent benchmark environment,i.e.,multiple particle system tasks,and the results demonstrate that the proposed method exhibits superiority compared to existing methods.(2)Contrastive learning-based multi-agent reinforcement learningDuring the training of MARL,existing methods fail to effectively extract informative features from high-dimensional joint state-action information,which leads to the difficulty of training well-performing models in a sample-efficient manner.To address this issue,this paper proposes a method called C2E-MARL that incorporates ensemble learning and contrastive learning.C2E-MARL deploys an ensemble framework with multiple diverse critic networks to comprehensively extract features from the joint state-action vector and then provides the estimation of state-action value,beneficial to improving the sample efficiency.Contrastive learning enhanced module is proposed for the encoder network to improve the ability to extract effective features from high-dimensional state vectors by constructing positive samples through an encoder network with a Dropout layer.Experiments are conducted on the multi-agent benchmark environment,and the results show that C2E-MARL exhibits better sample efficiency than existing MARL methods.(3)Fairness-aware based multi-agent reinforcement learningExisting MARL methods adopt a reward-driven learning paradigm,leading to biased action decisions.To address this issue,this paper proposes a method called FELight that takes traffic signal control as an application scenario.It utilizes fairness-aware reward function,counterfactual data augmentation,and self-supervised learning to balance performance and fairness.Fairness-aware reward function is designed to overcome the problem of unfair cases caused by a performance-driven decision-making process.The counterfactual augmentation module enhances the sample efficiency by learning relationships behind state transitions through the generative adversarial network and generating more samples.Self-supervised learning introduces extra training signals for the encoder to enhance the representation learning ability.Substantial experiments are conducted on the real traffic signal control datasets,and results demonstrate that FELight has superior performance compared to the state-of-the-art methods and provides effective and fair traffic signal control policy.(4)Hierarchical decision-based multi-agent reinforcement learningExisting multi-agent reinforcement learning methods directly solve tasks based on the original solution space,leading to the difficulty in effectively learning the desired model for tasks with high solution space complexity.To address this challenge,this paper proposes a method called HRL4 EC that takes the multi-model epidemic control as the environment.HRL4 EC decouples original actions into multi-level actions and achieves cooperative control.A multi-mode epidemic control simulation environment is constructed to simulate the process of epidemic control in the real world.The proposed HRL4 EC is used to decide the implementation strategy of multi-mode epidemic interventions.Experiments are performed on the dataset of epidemic outbreaks,and the experimental results show that HRL4 EC provides a more effective intervention deployment strategy to control the spread of the epidemic than the current methods.Meanwhile,HRL4 EC provides heuristic suggestions for developing prevention and control interventions for epidemic prevention departments.In summary,this thesis aims to optimize the sample efficiency of MARL in complex scenarios and promote its application in complex scenarios as the starting point,and focuses on the optimization of the model training process and the optimization of the agents’ interaction process,respectively.A series of researches on model robustness,state representation,decision fairness,and solution space complexity are carried out.The purpose is to make contributions at the theoretical level and have the application value in practice. |