| Multi-agent reinforcement learning usually needs to calculate the joint actions with global information of agents to make decision.For large-scale tasks,however,the growing number of agents will lead to exponential increase of overhead to estimate joint actions.In such case,decentralized training has shown better utility and scalability for getting free from complex joint action estimation,while how to explore more efficiently to obtain optimal policy still remains challenging for decentralized training in complex multi-agent scenarios.The challenge has its roots in that: 1)The optimal policy requires sufficient exploration on useful states,which may cause extremely high cost in large state space? 2)Decentralized training allows sharing knowledge among agents to improve sample efficency and cut down exploration cost in large-scale space.Nevertheless,when agents get training in parallel,their knowledge is likely to be sub-optimal considering insufficient exploration,which hinders other agents from actively searching for other better states,thus resulting in sub-optimal policies.How to balance the trade-off between reducing exploration cost and improving policy performance is the crux to ensuring the effectiveness of decentralized training,which is of great significance to deal with complex multi-agent tasks.This paper studies on knowledge sharing and exploring mechanism in multi-agent reinforcement learning,and the main contributions are as follows:(1)In order to balance the exploration cost and policy performance,this paper introduces the concept of cautiously-optimistic exploration,which asks agents to cautiously avoid useless states while optimistically exploring in unknown area,so that agents may learn better policy at smaller cost.(2)This paper proposes a novel knowledge sharing framework based on advising mechanism to realize cautiously-optimistic exploration.The key idea is to utilize the knowledge generated from failure experience to narrow down exploration space.The framework analyzes the form to reflect knowledge of success and failure in the past exploration,and trains agents to modify policies based on the knowledge from others and their own.Experimental results show that the framework significantly surpasses current works both in policy convergence and final performance in complex tasks.(3)To solve potential issue of one-sidedness,this paper further improves the above knowledge sharing framework.The improved framework gives comprehensive consideration on advices from multiple agents rather than randomly selecting one of them,so as to reduce the impact of single sub-optimal advice on policy.Furthermore,the new framework includes a filter to filter out the advices from less experienced agents to ensure positive transfer from advices.Experimental results show that the improved framework still achieves outperformance with faster convergence speed and better policy performance. |