Chen Zihan, Luo Biao, Hu Tianmeng, Xu Xiaodong
School of Automation, Central South University, Changsha 410083, China.
Neural Netw. 2023 Oct;167:450-459. doi: 10.1016/j.neunet.2023.08.016. Epub 2023 Aug 22.
Effective exploration is the key to achieving high returns for reinforcement learning. Agents must explore jointly in multi-agent systems to find the optimal joint policy. Due to the exploration problem and the shared reward, the policy-based multi-agent reinforcement learning algorithms face policy overfitting, which may lead to the joint policy falling into a local optimum. This paper introduces a novel general framework called Learning Joint-Action Intrinsic Reward (LJIR) for improving multi-agent reinforcement learners' joint exploration ability and performance. LJIR observes agents' state and joint actions to learn to construct an intrinsic reward online that can guide effective joint exploration. With the novel combination of Transformer and random network distillation, LJIR selects the novel states to give more intrinsic rewards, which help agents find the best joint actions. LJIR can dynamically adjust the weight of exploration and exploitation during training and keep the policy invariance finally. To ensure LJIR seamlessly adopts existing MARL algorithms, we also provide a flexible combination method for intrinsic and external rewards. Empirical results on the SMAC benchmark show that the proposed method achieves state-of-the-art performance in challenging tasks.
有效的探索是强化学习获得高回报的关键。在多智能体系统中,智能体必须联合探索以找到最优联合策略。由于探索问题和共享奖励,基于策略的多智能体强化学习算法面临策略过拟合问题,这可能导致联合策略陷入局部最优。本文介绍了一种名为学习联合动作内在奖励(LJIR)的新颖通用框架,用于提高多智能体强化学习者的联合探索能力和性能。LJIR观察智能体的状态和联合动作,以在线学习构建可指导有效联合探索的内在奖励。通过Transformer和随机网络蒸馏的新颖组合,LJIR选择新颖状态以给予更多内在奖励,这有助于智能体找到最佳联合动作。LJIR可以在训练期间动态调整探索和利用的权重,并最终保持策略不变性。为确保LJIR无缝采用现有的多智能体强化学习算法,我们还提供了一种内在奖励和外部奖励的灵活组合方法。在SMAC基准上的实证结果表明,该方法在具有挑战性的任务中取得了领先的性能。