Future Convergence Engineering, Department of Computer Science and Engineering, Korea University of Technology and Education, Cheonan, 31253, Republic of Korea.
Neural Netw. 2024 Apr;172:106149. doi: 10.1016/j.neunet.2024.106149. Epub 2024 Jan 26.
In this study, a novel exploration method for centralized training and decentralized execution (CTDE)-based multi-agent reinforcement learning (MARL) is introduced. The method uses the concept of strangeness, which is determined by evaluating (1) the level of the unfamiliarity of the observations an agent encounters and (2) the level of the unfamiliarity of the entire state the agents visit. An exploration bonus, which is derived from the concept of strangeness, is combined with the extrinsic reward obtained from the environment to form a mixed reward, which is then used for training CTDE-based MARL algorithms. Additionally, a separate action-value function is also proposed to prevent the high exploration bonus from overwhelming the sensitivity to extrinsic rewards during MARL training. This separate function is used to design the behavioral policy for generating transitions. The proposed method is not much affected by stochastic transitions commonly observed in MARL tasks and improves the stability of CTDE-based MARL algorithms when used with an exploration method. By providing didactic examples and demonstrating the substantial performance improvement of our proposed exploration method in CTDE-based MARL algorithms, we illustrate the advantages of our approach. These evaluations highlight how our method outperforms state-of-the-art MARL baselines on challenging tasks within the StarCraft II micromanagement benchmark, underscoring its effectiveness in improving MARL.
在这项研究中,引入了一种基于集中式训练和分散式执行 (CTDE) 的多智能体强化学习 (MARL) 的新探索方法。该方法使用陌生度的概念来确定,这是通过评估 (1) 智能体遇到的观察结果的陌生程度和 (2) 智能体访问的整个状态的陌生程度来确定的。探索奖金是从陌生度的概念中得出的,与从环境中获得的外在奖励相结合,形成混合奖励,然后用于训练基于 CTDE 的 MARL 算法。此外,还提出了一个单独的动作值函数,以防止在 MARL 训练期间高探索奖金淹没对外部奖励的敏感性。该单独的函数用于设计用于生成转换的行为策略。所提出的方法受 MARL 任务中常见的随机转换的影响不大,并且在与探索方法一起使用时可以提高基于 CTDE 的 MARL 算法的稳定性。通过提供说明性示例,并展示我们提出的探索方法在基于 CTDE 的 MARL 算法中的显著性能改进,我们说明了我们方法的优势。这些评估突出了我们的方法如何在星际争霸 II 微观管理基准中的挑战性任务上胜过最先进的 MARL 基线,强调了它在提高 MARL 方面的有效性。