Jin Weiqiang, Tian Xingwu, Wang Ningwei, Wu Baohai, Shi Bohang, Zhao Biao, Yang Guang
School of Information and Communications Engineering, Xi'an Jiaotong University, Xi'an, Shanxi, 710049, Shanxi, China.
School of Information and Communications Engineering, Xi'an Jiaotong University, Xi'an, Shanxi, 710049, Shanxi, China; Artificial Intelligent Institute of iFLYTEK Research, Heifei, Anhui, 230088, China.
Neural Netw. 2025 Jul 15;192:107875. doi: 10.1016/j.neunet.2025.107875.
Multi-agent reinforcement learning (MARL) plays a pivotal role in solving complex decision-making problems wherein multiple agents interact in a shared environment. However, mainstream MARL algorithms still suffer the following challenges: 1) the policies of agents tend to converge and stabilise during learning, which leads to insufficient explorations and sub-optimal strategies, particularly in environments with extremely large state, observation and action spaces and 2) the sampling inefficiency of MARL results in inadequate learning from the experience replay buffer, requiring a massive number of environmental interactions. To address these issues, we propose a novel MARL approach for various multi-agent decision-making tasks, namely efficient eXploration Joint with Training Unbiased for MARL (eXJTU-MARL), to fully enhance exploration efficiency during environmental interactions and the trajectory learning efficiency from the experience replay buffer. To achieve this, we introduce two core modules in eXJTU-MARL: adaptive policy resetting and state representation based balanced experience sampling. Specifically, for the first time, we introduce a state representation based sampling strategy that enhances data efficiency by improving the quality of experience replay samples in MARL. Accordingly, eXJTU-MARL effectively enhances sample efficiency, prevents agents from prematurely converging into sub-optimal policies and facilitates sufficient exploration of the state-action space. Extensive experiments in the StarCraft Multi-Agent Challenge environment demonstrate that our eXJTU-MARL consistently outperforms mainstream MARL baselines, highlighting the effectiveness of adaptive policy resetting and balanced experience sampling in enhancing the overall exploration capabilities and learning efficiency of MARL models in complex multi-agent environments. The code is available at GitHub: https://github.com/albert-jin/eXJTU-MARL.
多智能体强化学习(MARL)在解决复杂决策问题中起着关键作用,其中多个智能体在共享环境中相互作用。然而,主流的MARL算法仍然面临以下挑战:1)智能体的策略在学习过程中倾向于收敛和稳定,这导致探索不足和次优策略,特别是在具有极大状态、观察和动作空间的环境中;2)MARL的采样效率低下导致从经验回放缓冲区的学习不足,需要大量的环境交互。为了解决这些问题,我们提出了一种针对各种多智能体决策任务的新颖MARL方法,即高效探索与无偏训练联合的MARL(eXJTU-MARL),以充分提高环境交互期间的探索效率以及从经验回放缓冲区的轨迹学习效率。为了实现这一点,我们在eXJTU-MARL中引入了两个核心模块:自适应策略重置和基于状态表示的平衡经验采样。具体而言,我们首次引入了一种基于状态表示的采样策略,通过提高MARL中经验回放样本的质量来提高数据效率。因此,eXJTU-MARL有效地提高了样本效率,防止智能体过早收敛到次优策略,并促进对状态-动作空间的充分探索。在星际争霸多智能体挑战赛环境中的大量实验表明,我们的eXJTU-MARL始终优于主流的MARL基线,突出了自适应策略重置和平衡经验采样在增强复杂多智能体环境中MARL模型的整体探索能力和学习效率方面的有效性。代码可在GitHub上获取:https://github.com/albert-jin/eXJTU-MARL。