Tübingen AI Center, University of Tübingen, Tübingen, Germany.
Department of Applied Mathematics, University of Twente, Enschede, The Netherlands.
Sci Rep. 2023 Jan 24;13(1):1309. doi: 10.1038/s41598-023-27672-7.
In this work, we ask for and answer what makes classical temporal-difference reinforcement learning with [Formula: see text]-greedy strategies cooperative. Cooperating in social dilemma situations is vital for animals, humans, and machines. While evolutionary theory revealed a range of mechanisms promoting cooperation, the conditions under which agents learn to cooperate are contested. Here, we demonstrate which and how individual elements of the multi-agent learning setting lead to cooperation. We use the iterated Prisoner's dilemma with one-period memory as a testbed. Each of the two learning agents learns a strategy that conditions the following action choices on both agents' action choices of the last round. We find that next to a high caring for future rewards, a low exploration rate, and a small learning rate, it is primarily intrinsic stochastic fluctuations of the reinforcement learning process which double the final rate of cooperation to up to 80%. Thus, inherent noise is not a necessary evil of the iterative learning process. It is a critical asset for the learning of cooperation. However, we also point out the trade-off between a high likelihood of cooperative behavior and achieving this in a reasonable amount of time. Our findings are relevant for purposefully designing cooperative algorithms and regulating undesired collusive effects.
在这项工作中,我们提出并回答了是什么使得具有[公式:见文本]-贪婪策略的经典时间差分强化学习具有合作性。在社会困境情况下的合作对动物、人类和机器来说都是至关重要的。虽然进化理论揭示了一系列促进合作的机制,但学习合作的条件仍存在争议。在这里,我们展示了多智能体学习环境中的哪些因素以及如何导致合作。我们使用具有单周期记忆的迭代囚徒困境作为测试平台。两个学习代理中的每一个都学习一种策略,该策略根据上一轮两个代理的动作选择来条件化后续动作选择。我们发现,除了对未来奖励的高度关注、低探索率和小学习率外,强化学习过程中的内在随机波动主要将最终合作率提高到 80%。因此,内在噪声不是迭代学习过程的必要之恶。它是学习合作的关键资产。然而,我们也指出了在实现合作行为的高可能性和在合理的时间内实现这一目标之间的权衡。我们的发现对于有目的地设计合作算法和调节不良的勾结效应具有重要意义。