Suppr超能文献

内在强化学习波动促进合作。

Intrinsic fluctuations of reinforcement learning promote cooperation.

机构信息

Tübingen AI Center, University of Tübingen, Tübingen, Germany.

Department of Applied Mathematics, University of Twente, Enschede, The Netherlands.

出版信息

Sci Rep. 2023 Jan 24;13(1):1309. doi: 10.1038/s41598-023-27672-7.

Abstract

In this work, we ask for and answer what makes classical temporal-difference reinforcement learning with [Formula: see text]-greedy strategies cooperative. Cooperating in social dilemma situations is vital for animals, humans, and machines. While evolutionary theory revealed a range of mechanisms promoting cooperation, the conditions under which agents learn to cooperate are contested. Here, we demonstrate which and how individual elements of the multi-agent learning setting lead to cooperation. We use the iterated Prisoner's dilemma with one-period memory as a testbed. Each of the two learning agents learns a strategy that conditions the following action choices on both agents' action choices of the last round. We find that next to a high caring for future rewards, a low exploration rate, and a small learning rate, it is primarily intrinsic stochastic fluctuations of the reinforcement learning process which double the final rate of cooperation to up to 80%. Thus, inherent noise is not a necessary evil of the iterative learning process. It is a critical asset for the learning of cooperation. However, we also point out the trade-off between a high likelihood of cooperative behavior and achieving this in a reasonable amount of time. Our findings are relevant for purposefully designing cooperative algorithms and regulating undesired collusive effects.

摘要

在这项工作中,我们提出并回答了是什么使得具有[公式:见文本]-贪婪策略的经典时间差分强化学习具有合作性。在社会困境情况下的合作对动物、人类和机器来说都是至关重要的。虽然进化理论揭示了一系列促进合作的机制,但学习合作的条件仍存在争议。在这里,我们展示了多智能体学习环境中的哪些因素以及如何导致合作。我们使用具有单周期记忆的迭代囚徒困境作为测试平台。两个学习代理中的每一个都学习一种策略,该策略根据上一轮两个代理的动作选择来条件化后续动作选择。我们发现,除了对未来奖励的高度关注、低探索率和小学习率外,强化学习过程中的内在随机波动主要将最终合作率提高到 80%。因此,内在噪声不是迭代学习过程的必要之恶。它是学习合作的关键资产。然而,我们也指出了在实现合作行为的高可能性和在合理的时间内实现这一目标之间的权衡。我们的发现对于有目的地设计合作算法和调节不良的勾结效应具有重要意义。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d133/9873645/5136ede5c567/41598_2023_27672_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验