Suppr超能文献

SMIX(λ):增强用于协作多智能体强化学习的集中式价值函数

SMIX(λ): Enhancing Centralized Value Functions for Cooperative Multiagent Reinforcement Learning.

作者信息

Yao Xinghu, Wen Chao, Wang Yuhui, Tan Xiaoyang

出版信息

IEEE Trans Neural Netw Learn Syst. 2023 Jan;34(1):52-63. doi: 10.1109/TNNLS.2021.3089493. Epub 2023 Jan 5.

Abstract

Learning a stable and generalizable centralized value function (CVF) is a crucial but challenging task in multiagent reinforcement learning (MARL), as it has to deal with the issue that the joint action space increases exponentially with the number of agents in such scenarios. This article proposes an approach, named SMIX( λ ), that uses an OFF-policy training to achieve this by avoiding the greedy assumption commonly made in CVF learning. As importance sampling for such OFF-policy training is both computationally costly and numerically unstable, we proposed to use the λ -return as a proxy to compute the temporal difference (TD) error. With this new loss function objective, we adopt a modified QMIX network structure as the base to train our model. By further connecting it with the Q(λ) approach from a unified expectation correction viewpoint, we show that the proposed SMIX( λ ) is equivalent to Q(λ) and hence shares its convergence properties, while without being suffered from the aforementioned curse of dimensionality problem inherent in MARL. Experiments on the StarCraft Multiagent Challenge (SMAC) benchmark demonstrate that our approach not only outperforms several state-of-the-art MARL methods by a large margin but also can be used as a general tool to improve the overall performance of other centralized training with decentralized execution (CTDE)-type algorithms by enhancing their CVFs.

摘要

在多智能体强化学习(MARL)中,学习一个稳定且可泛化的集中式价值函数(CVF)是一项关键但具有挑战性的任务,因为在这种情况下,联合动作空间会随着智能体数量呈指数级增长。本文提出了一种名为SMIX(λ)的方法,该方法使用离策略训练来实现这一目标,通过避免CVF学习中通常所做的贪婪假设。由于这种离策略训练的重要性采样在计算上成本高昂且数值不稳定,我们建议使用λ回报作为代理来计算时间差分(TD)误差。基于这个新的损失函数目标,我们采用一种改进的QMIX网络结构作为基础来训练我们的模型。从统一的期望校正观点将其与Q(λ)方法进一步联系起来,我们表明所提出的SMIX(λ)等同于Q(λ),因此具有其收敛特性,同时不会受到MARL中固有的上述维度灾难问题的影响。在星际争霸多智能体挑战赛(SMAC)基准测试上的实验表明,我们的方法不仅大幅优于几种当前最先进的MARL方法,而且还可以作为一种通用工具,通过增强其CVF来提高其他集中训练与分散执行(CTDE)类型算法的整体性能。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验