Suppr超能文献

差异奖励策略梯度。

Difference rewards policy gradients.

作者信息

Castellini Jacopo, Devlin Sam, Oliehoek Frans A, Savani Rahul

机构信息

Department of Computer Science, University of Liverpool, Liverpool, UK.

Microsoft Research Cambridge, Cambridge, UK.

出版信息

Neural Comput Appl. 2025;37(19):13163-13186. doi: 10.1007/s00521-022-07960-5. Epub 2022 Nov 11.

Abstract

Policy gradient methods have become one of the most popular classes of algorithms for multi-agent reinforcement learning. A key challenge, however, that is not addressed by many of these methods is multi-agent credit assignment: assessing an agent's contribution to the overall performance, which is crucial for learning good policies. We propose a novel algorithm called Dr.Reinforce that explicitly tackles this by combining difference rewards with policy gradients to allow for learning decentralized policies when the reward function is known. By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the -function as done by counterfactual multi-agent policy gradients (COMA), a state-of-the-art difference rewards method. For applications where the reward function is unknown, we show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.

摘要

策略梯度方法已成为多智能体强化学习中最流行的算法类别之一。然而,许多此类方法未解决的一个关键挑战是多智能体信用分配:评估智能体对整体性能的贡献,这对于学习良好策略至关重要。我们提出了一种名为Dr.Reinforce的新颖算法,该算法通过将差异奖励与策略梯度相结合来明确解决此问题,以便在奖励函数已知时学习去中心化策略。通过直接对奖励函数进行差分,Dr.Reinforce避免了与像反事实多智能体策略梯度(COMA,一种最先进的差异奖励方法)那样学习Q函数相关的困难。对于奖励函数未知的应用,我们展示了Dr.Reinforce的一个版本的有效性,该版本学习一个额外的奖励网络,用于估计差异奖励。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f6d/12204931/332a9da7ba14/521_2022_7960_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验