Li Simin, Xu Ruixiao, Xiu Jingqiao, Zheng Yuwei, Feng Pu, Ma Yuqing, An Bo, Yang Yaodong, Liu Xianglong
IEEE Trans Neural Netw Learn Syst. 2025 Oct;36(10):18118-18132. doi: 10.1109/TNNLS.2025.3577259.
In cooperative multi-agent reinforcement learning (MARL), ensuring robustness against cooperative agents making unpredictable or worst-case adversarial actions is crucial for real-world deployment. In multi-agent settings, each agent may be perturbed or unperturbed, leading to an exponential increase in potential threat scenarios as the number of agents grows. Existing robust MARL methods either enumerate, or approximate all possible threat scenarios, leading to intense computation and insufficient robustness. In contrast, humans develop robust behaviors by maintaining a general level of caution rather than preparing for every possible threat. Inspired by human decision making, we frame robust MARL as a control-as-inference problem, and optimize worst-case robustness across all threat scenarios implicitly optimized through off-policy evaluation. Specifically, we introduce mutual information regularization as robust regularization (MIR3), which maximizes a lower bound on robustness during routine training, serving as a kind of caution for MARL without adversarial inputs. Further insights show that MIR3 acts as an information bottleneck, preventing agents from over-reacting to others and aligning policies with robust action priors. In the presence of worst-case adversaries, our MIR3 significantly surpasses baseline methods in robustness and training efficiency, and maintaining cooperative performance in StarCraft II, quadrotor swarm control, and robot swarm control. When deploying the robot swarm control algorithm in the real world, our method also outperforms the best baseline by 14.29% in reward. See code and demo videos at https://github.com/DIG-Beihang/MIR3.
在合作多智能体强化学习(MARL)中,确保对做出不可预测或最坏情况对抗性行动的合作智能体具有鲁棒性,对于实际应用至关重要。在多智能体环境中,每个智能体可能受到干扰或未受干扰,随着智能体数量的增加,潜在威胁场景呈指数级增长。现有的鲁棒MARL方法要么枚举,要么近似所有可能的威胁场景,导致计算量巨大且鲁棒性不足。相比之下,人类通过保持一般的谨慎程度来发展鲁棒行为,而不是为每一种可能的威胁做准备。受人类决策启发,我们将鲁棒MARL构建为一个控制即推理问题,并通过离策略评估隐式地优化所有威胁场景下的最坏情况鲁棒性。具体来说,我们引入互信息正则化作为鲁棒正则化(MIR3),它在常规训练期间最大化鲁棒性的下限,为没有对抗性输入的MARL提供一种谨慎。进一步的分析表明,MIR3起到了信息瓶颈的作用,防止智能体对其他智能体过度反应,并使策略与鲁棒行动先验保持一致。在存在最坏情况对手的情况下,我们的MIR3在鲁棒性和训练效率方面显著超越基线方法,并在《星际争霸II》、四旋翼无人机群控制和机器人集群控制中保持合作性能。在将机器人集群控制算法部署到现实世界中时,我们的方法在奖励方面也比最佳基线高出14.29%。请访问https://github.com/DIG-Beihang/MIR3查看代码和演示视频。