• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

合作多智能体强化学习的反事实值分解

Counterfactual value decomposition for cooperative multi-agent reinforcement learning.

作者信息

Liu Kai, Zhang Tianxian, Xu Xiangliang, Zhao Yuyang

机构信息

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, Sichuan, China.

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, Sichuan, China.

出版信息

Neural Netw. 2025 Oct;190:107692. doi: 10.1016/j.neunet.2025.107692. Epub 2025 Jun 16.

DOI:10.1016/j.neunet.2025.107692
PMID:40554295
Abstract

Value decomposition has become a central focus in Multi-Agent Reinforcement Learning (MARL) in recent years. The key challenge lies in the construction and updating of the factored value function (FVF). Traditional methods rely on FVFs with restricted representational capacity, rendering them inadequate for tasks with non-monotonic payoffs. Recent approaches address this limitation by designing FVF update mechanisms that enable applicability to non-monotonic scenarios. However, these methods typically depend on the true optimal joint action value to guide FVF updates. Since the true optimal joint action is computationally infeasible in practice, these methods approximate it using the greedy joint action and update the FVF with the corresponding greedy joint action value. We observe that although the greedy joint action may be close to the true optimal joint action, its associated greedy joint action value can be substantially biased relative to the true optimal joint action value. This makes the approximation unreliable and can lead to incorrect update directions for the FVF, hindering the learning process. To overcome this limitation, we propose Comix, a novel off-policy MARL method based on a Sandwich Value Decomposition Framework. Comix constrains and guides FVF updates using both upper and lower bounds. Specifically, it leverages orthogonal best responses to construct the upper bound, thus overcoming the drawbacks introduced by the optimal approximation. Furthermore, an attention mechanism is incorporated to ensure that the upper bound can be computed with linear time complexity and high accuracy. Theoretical analyses show that Comix satisfies the IGM. Experiments on the asymmetric One-Step Matrix Game, discrete Predator-Prey, and StarCraft Multi-Agent Challenge show that Comix achieves higher learning efficiency and outperforms several state-of-the-art methods.

摘要

近年来,值分解已成为多智能体强化学习(MARL)的核心焦点。关键挑战在于因式分解值函数(FVF)的构建和更新。传统方法依赖于具有受限表示能力的FVF,使其不足以处理具有非单调收益的任务。最近的方法通过设计FVF更新机制来解决这一限制,该机制能够适用于非单调场景。然而,这些方法通常依赖于真正的最优联合动作值来指导FVF更新。由于在实际中计算真正的最优联合动作是不可行的,这些方法使用贪婪联合动作对其进行近似,并使用相应的贪婪联合动作值来更新FVF。我们观察到,尽管贪婪联合动作可能接近真正的最优联合动作,但其相关的贪婪联合动作值相对于真正的最优联合动作值可能存在显著偏差。这使得近似不可靠,并可能导致FVF的更新方向错误,从而阻碍学习过程。为了克服这一限制,我们提出了Comix,这是一种基于三明治值分解框架的新型离策略MARL方法。Comix使用上下界来约束和指导FVF更新。具体而言,它利用正交最佳响应来构建上界,从而克服了最优近似引入的缺点。此外,还引入了一种注意力机制,以确保能够以线性时间复杂度和高精度计算上界。理论分析表明Comix满足IGM。在非对称一步矩阵博弈、离散捕食者 - 猎物和星际争霸多智能体挑战赛上的实验表明,Comix实现了更高的学习效率,并且优于几种现有最先进的方法。

相似文献

1
Counterfactual value decomposition for cooperative multi-agent reinforcement learning.合作多智能体强化学习的反事实值分解
Neural Netw. 2025 Oct;190:107692. doi: 10.1016/j.neunet.2025.107692. Epub 2025 Jun 16.
2
Short-Term Memory Impairment短期记忆障碍
3
The Black Book of Psychotropic Dosing and Monitoring.《精神药物剂量与监测黑皮书》
Psychopharmacol Bull. 2024 Jul 8;54(3):8-59.
4
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
5
Representation-driven sampling and adaptive policy resetting for improving multi-Agent reinforcement learning.用于改进多智能体强化学习的表征驱动采样与自适应策略重置
Neural Netw. 2025 Jul 15;192:107875. doi: 10.1016/j.neunet.2025.107875.
6
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状Meta分析。
Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.
7
Assessing the comparative effects of interventions in COPD: a tutorial on network meta-analysis for clinicians.评估慢性阻塞性肺疾病干预措施的比较效果:面向临床医生的网状Meta分析教程
Respir Res. 2024 Dec 21;25(1):438. doi: 10.1186/s12931-024-03056-x.
8
"In a State of Flow": A Qualitative Examination of Autistic Adults' Phenomenological Experiences of Task Immersion.“心流状态”:对自闭症成年人任务沉浸现象学体验的质性研究
Autism Adulthood. 2024 Sep 16;6(3):362-373. doi: 10.1089/aut.2023.0032. eCollection 2024 Sep.
9
Sexual Harassment and Prevention Training性骚扰与预防培训
10
Q-learning with temporal memory to navigate turbulence.基于时间记忆的Q学习以应对动荡。
Elife. 2025 Jul 21;13:RP102906. doi: 10.7554/eLife.102906.