内在强化学习波动促进合作。

Intrinsic fluctuations of reinforcement learning promote cooperation.

机构信息

Tübingen AI Center, University of Tübingen, Tübingen, Germany.

Department of Applied Mathematics, University of Twente, Enschede, The Netherlands.

出版信息

Sci Rep. 2023 Jan 24;13(1):1309. doi: 10.1038/s41598-023-27672-7.

DOI:10.1038/s41598-023-27672-7

PMID:36693872

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9873645/

Abstract

In this work, we ask for and answer what makes classical temporal-difference reinforcement learning with [Formula: see text]-greedy strategies cooperative. Cooperating in social dilemma situations is vital for animals, humans, and machines. While evolutionary theory revealed a range of mechanisms promoting cooperation, the conditions under which agents learn to cooperate are contested. Here, we demonstrate which and how individual elements of the multi-agent learning setting lead to cooperation. We use the iterated Prisoner's dilemma with one-period memory as a testbed. Each of the two learning agents learns a strategy that conditions the following action choices on both agents' action choices of the last round. We find that next to a high caring for future rewards, a low exploration rate, and a small learning rate, it is primarily intrinsic stochastic fluctuations of the reinforcement learning process which double the final rate of cooperation to up to 80%. Thus, inherent noise is not a necessary evil of the iterative learning process. It is a critical asset for the learning of cooperation. However, we also point out the trade-off between a high likelihood of cooperative behavior and achieving this in a reasonable amount of time. Our findings are relevant for purposefully designing cooperative algorithms and regulating undesired collusive effects.

摘要

在这项工作中，我们提出并回答了是什么使得具有[公式：见文本]-贪婪策略的经典时间差分强化学习具有合作性。在社会困境情况下的合作对动物、人类和机器来说都是至关重要的。虽然进化理论揭示了一系列促进合作的机制，但学习合作的条件仍存在争议。在这里，我们展示了多智能体学习环境中的哪些因素以及如何导致合作。我们使用具有单周期记忆的迭代囚徒困境作为测试平台。两个学习代理中的每一个都学习一种策略，该策略根据上一轮两个代理的动作选择来条件化后续动作选择。我们发现，除了对未来奖励的高度关注、低探索率和小学习率外，强化学习过程中的内在随机波动主要将最终合作率提高到 80%。因此，内在噪声不是迭代学习过程的必要之恶。它是学习合作的关键资产。然而，我们也指出了在实现合作行为的高可能性和在合理的时间内实现这一目标之间的权衡。我们的发现对于有目的地设计合作算法和调节不良的勾结效应具有重要意义。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d133/9873645/5136ede5c567/41598_2023_27672_Fig1_HTML.jpg

相似文献

Intrinsic fluctuations of reinforcement learning promote cooperation.

Sci Rep. 2023 Jan 24;13(1):1309. doi: 10.1038/s41598-023-27672-7.

A theoretical analysis of temporal difference learning in the iterated prisoner's dilemma game.

Bull Math Biol. 2009 Nov;71(8):1818-50. doi: 10.1007/s11538-009-9424-8. Epub 2009 May 29.

Incorporating social payoff into reinforcement learning promotes cooperation.

Chaos. 2022 Dec;32(12):123140. doi: 10.1063/5.0093996.

Reinforcement learning produces dominant strategies for the Iterated Prisoner's Dilemma.

PLoS One. 2017 Dec 11;12(12):e0188046. doi: 10.1371/journal.pone.0188046. eCollection 2017.

Inferring strategies from observations in long iterated Prisoner's dilemma experiments.

Sci Rep. 2022 May 9;12(1):7589. doi: 10.1038/s41598-022-11654-2.

Prisoner's dilemma game model Based on historical strategy information.

Sci Rep. 2023 Jan 2;13(1):1. doi: 10.1038/s41598-022-26890-9.

Cooperation enhanced by the coevolution of teaching activity in evolutionary prisoner's dilemma games with voluntary participation.

PLoS One. 2018 Feb 16;13(2):e0193151. doi: 10.1371/journal.pone.0193151. eCollection 2018.

Numerical analysis of a reinforcement learning model with the dynamic aspiration level in the iterated Prisoner's dilemma.

J Theor Biol. 2011 Jun 7;278(1):55-62. doi: 10.1016/j.jtbi.2011.03.005. Epub 2011 Mar 29.

Learning to Cooperate: The Evolution of Social Rewards in Repeated Interactions.

Am Nat. 2018 Jan;191(1):58-73. doi: 10.1086/694822. Epub 2017 Nov 20.

Optimality under noise: higher memory strategies for the alternating prisoner's dilemma.

J Theor Biol. 2001 Jul 21;211(2):159-80. doi: 10.1006/jtbi.2001.2337.

引用本文的文献

Collective cooperative intelligence.

Proc Natl Acad Sci U S A. 2025 Jun 24;122(25):e2319948121. doi: 10.1073/pnas.2319948121. Epub 2025 Jun 16.

How social reinforcement learning can lead to metastable polarisation and the voter model.

PLoS One. 2024 Dec 17;19(12):e0313951. doi: 10.1371/journal.pone.0313951. eCollection 2024.

Moderate confirmation bias enhances decision-making in groups of reinforcement-learning agents.

PLoS Comput Biol. 2024 Sep 4;20(9):e1012404. doi: 10.1371/journal.pcbi.1012404. eCollection 2024 Sep.

本文引用的文献

Modeling the effects of environmental and perceptual uncertainty using deterministic reinforcement learning dynamics with partial observability.

Phys Rev E. 2022 Mar;105(3-1):034409. doi: 10.1103/PhysRevE.105.034409.

Dynamical systems as a level of cognitive analysis of multi-agent learning: Algorithmic foundations of temporal-difference learning dynamics.

Neural Comput Appl. 2022;34(3):1653-1671. doi: 10.1007/s00521-021-06117-0. Epub 2021 Jun 23.

Biased perceptions explain collective action deadlocks and suggest new mechanisms to prompt cooperation.

iScience. 2021 Mar 29;24(4):102375. doi: 10.1016/j.isci.2021.102375. eCollection 2021 Apr 23.

Cooperative AI: machines must learn to find common ground.

Nature. 2021 May;593(7857):33-36. doi: 10.1038/d41586-021-01170-0.

Protecting consumers from collusive prices due to AI.

Science. 2020 Nov 27;370(6520):1040-1042. doi: 10.1126/science.abe3796.

Deep Reinforcement Learning and Its Neuroscientific Implications.

Neuron. 2020 Aug 19;107(4):603-616. doi: 10.1016/j.neuron.2020.06.014. Epub 2020 Jul 13.

Caring for the future can turn tragedy into comedy for long-term collective action under risk of collapse.

Proc Natl Acad Sci U S A. 2020 Jun 9;117(23):12915-12922. doi: 10.1073/pnas.1916545117. Epub 2020 May 20.

Deterministic limit of temporal difference reinforcement learning for stochastic games.

Phys Rev E. 2019 Apr;99(4-1):043305. doi: 10.1103/PhysRevE.99.043305.

When optimization for governing human-environment tipping elements is neither sustainable nor safe.

Nat Commun. 2018 Jun 15;9(1):2354. doi: 10.1038/s41467-018-04738-z.

Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin.

PLoS Comput Biol. 2016 Jul 20;12(7):e1005034. doi: 10.1371/journal.pcbi.1005034. eCollection 2016 Jul.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

内在强化学习波动促进合作。

Intrinsic fluctuations of reinforcement learning promote cooperation.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献