Hu Xiaoliang, Guo Pengcheng, Li Yadong, Li Guangyu, Cui Zhen, Yang Jian
IEEE Trans Neural Netw Learn Syst. 2025 Jul;36(7):12521-12534. doi: 10.1109/TNNLS.2024.3455422.
In cooperative multiagent reinforcement learning (MARL), centralized training with decentralized execution (CTDE) has recently attracted more attention due to the physical demand. However, the most dilemma therein is the inconsistency between jointly-trained policies and individually executed actions. In this article, we propose a factorized Tchebycheff value-decomposition optimization (TVDO) method to overcome the trouble of inconsistency. In particular, a nonlinear Tchebycheff aggregation function is formulated to realize the global optimum by tightly constraining the upper bound of individual action-value bias, which is inspired by the Tchebycheff method of multiobjective optimization (MOO). We theoretically prove that, under no extra limitations, the factorized value decomposition with Tchebycheff aggregation satisfies the sufficiency and necessity of individual-global-max (IGM), which guarantees the consistency between the global and individual optimal action-value function. Empirically, in the climb and penalty game, we verify that TVDO precisely expresses the global-to-individual value decomposition with a guarantee of policy consistency. Meanwhile, we evaluate TVDO in the StarCraft multiagent challenge (SMAC) benchmark, and extensive experiments demonstrate that TVDO achieves a significant performance superiority over some SOTA MARL baselines.
在协作多智能体强化学习(MARL)中,集中训练与分散执行(CTDE)由于实际需求最近受到了更多关注。然而,其中最棘手的问题是联合训练的策略与单独执行的动作之间的不一致性。在本文中,我们提出了一种因式分解的切比雪夫值分解优化(TVDO)方法来克服不一致性问题。具体而言,我们构建了一个非线性切比雪夫聚合函数,通过严格限制个体动作值偏差的上限来实现全局最优,这一灵感来源于多目标优化(MOO)中的切比雪夫方法。我们从理论上证明,在没有额外限制的情况下,采用切比雪夫聚合的因式分解值分解满足个体 - 全局最大值(IGM)的充分性和必要性,这保证了全局和个体最优动作值函数之间的一致性。从实验上看,在攀爬和惩罚游戏中,我们验证了TVDO精确地表达了全局到个体的值分解,并保证了策略的一致性。同时,我们在星际争霸多智能体挑战赛(SMAC)基准测试中评估了TVDO,大量实验表明TVDO相对于一些最先进的MARL基线取得了显著的性能优势。