• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

重复囚徒困境中的多智能体强化学习

Multiagent reinforcement learning in the Iterated Prisoner's Dilemma.

作者信息

Sandholm T W, Crites R H

机构信息

Computer Science Department, University of Massachusetts at Amherst 01003, USA.

出版信息

Biosystems. 1996;37(1-2):147-66. doi: 10.1016/0303-2647(95)01551-5.

DOI:10.1016/0303-2647(95)01551-5
PMID:8924633
Abstract

Reinforcement learning (RL) is based on the idea that the tendency to produce an action should be strengthened (reinforced) if it produces favorable results, and weakened if it produces unfavorable results. Q-learning is a recent RL algorithm that does not need a model of its environment and can be used on-line. Therefore, it is well suited for use in repeated games against an unknown opponent. Most RL research has been confined to single-agent settings or to multiagent settings where the agents have totally positively correlated payoffs (team problems) or totally negatively correlated payoffs (zero-sum games). This paper is an empirical study of reinforcement learning in the Iterated Prisoner's Dilemma (IPD), where the agents' payoffs are neither totally positively nor totally negatively correlated. RL is considerably more difficult in such a domain. This paper investigates the ability of a variety of Q-learning agents to play the IPD game against an unknown opponent. In some experiments, the opponent is the fixed strategy Tit-For-Tat, while in others it is another Q-learner. All the Q-learners learned to play optimally against Tit-For-Tat. Playing against another learner was more difficult because the adaptation of the other learner created a non-stationary environment, and because the other learner was not endowed with any a priori knowledge about the IPD game such as a policy designed to encourage cooperation. The learners that were studied varied along three dimensions: the length of history they received as context, the type of memory they employed (lookup tables based on restricted history windows or recurrent neural networks that can theoretically store features from arbitrarily deep in the past), and the exploration schedule they followed. Although all the learners faced difficulties when playing against other learners, agents with longer history windows, lookup table memories, and longer exploration schedules fared best in the IPD games.

摘要

强化学习(RL)基于这样一种理念:如果一个动作产生了有利的结果,那么产生该动作的倾向就应该得到加强(强化);如果产生了不利的结果,那么这种倾向就应该被削弱。Q学习是一种最近的强化学习算法,它不需要其环境的模型,并且可以在线使用。因此,它非常适合用于与未知对手进行的重复博弈。大多数强化学习研究都局限于单智能体设置,或者多智能体设置中智能体的收益完全正相关(团队问题)或完全负相关(零和博弈)的情况。本文是对迭代囚徒困境(IPD)中强化学习的实证研究,在该情境中智能体的收益既不是完全正相关也不是完全负相关。在这样一个领域中,强化学习要困难得多。本文研究了各种Q学习智能体与未知对手进行IPD博弈的能力。在一些实验中,对手是固定策略“以牙还牙”,而在其他实验中,对手是另一个Q学习者。所有的Q学习者都学会了最优地对抗“以牙还牙”策略。与另一个学习者对抗则更困难,因为另一个学习者的适应性创造了一个非平稳环境,并且因为另一个学习者没有被赋予任何关于IPD博弈的先验知识,比如旨在鼓励合作的策略。所研究的学习者在三个维度上有所不同:作为上下文他们所接收的历史长度、他们所采用的记忆类型(基于受限历史窗口的查找表或理论上可以存储过去任意深度特征的循环神经网络)以及他们所遵循的探索计划。尽管所有学习者在与其他学习者对抗时都面临困难,但在IPD博弈中,具有更长历史窗口、查找表记忆和更长探索计划的智能体表现最佳。

相似文献

1
Multiagent reinforcement learning in the Iterated Prisoner's Dilemma.重复囚徒困境中的多智能体强化学习
Biosystems. 1996;37(1-2):147-66. doi: 10.1016/0303-2647(95)01551-5.
2
Spiking neural networks with different reinforcement learning (RL) schemes in a multiagent setting.在多智能体环境中采用不同强化学习(RL)方案的脉冲神经网络。
Chin J Physiol. 2010 Dec 31;53(6):447-53.
3
Multiagent reinforcement learning: spiking and nonspiking agents in the iterated Prisoner's Dilemma.多智能体强化学习:重复囚徒困境中的脉冲式和非脉冲式智能体
IEEE Trans Neural Netw. 2011 Apr;22(4):639-53. doi: 10.1109/TNN.2011.2111384. Epub 2011 Mar 17.
4
A theoretical analysis of temporal difference learning in the iterated prisoner's dilemma game.在迭代囚徒困境博弈中对时间差分学习的理论分析。
Bull Math Biol. 2009 Nov;71(8):1818-50. doi: 10.1007/s11538-009-9424-8. Epub 2009 May 29.
5
Self-control with spiking and non-spiking neural networks playing games.通过脉冲神经网络和非脉冲神经网络进行游戏时的自我控制。
J Physiol Paris. 2010 May-Sep;104(3-4):108-17. doi: 10.1016/j.jphysparis.2009.11.013. Epub 2009 Nov 26.
6
Contingencies of reinforcement in a five-person prisoner's dilemma.五人囚徒困境中的强化偶然性
J Exp Anal Behav. 2004 Sep;82(2):161-76. doi: 10.1901/jeab.2004.82-161.
7
Cooperative responses in rats playing a 2 × 2 game: Effects of opponent strategy, payoff, and oxytocin.大鼠在 2×2 游戏中表现出的合作反应:对手策略、收益和催产素的影响。
Psychoneuroendocrinology. 2020 Nov;121:104803. doi: 10.1016/j.psyneuen.2020.104803. Epub 2020 Aug 2.
8
Collapse of cooperation in evolving games.进化博弈中合作的瓦解。
Proc Natl Acad Sci U S A. 2014 Dec 9;111(49):17558-63. doi: 10.1073/pnas.1408618111. Epub 2014 Nov 24.
9
Autocratic strategies for iterated games with arbitrary action spaces.具有任意行动空间的重复博弈的独裁策略。
Proc Natl Acad Sci U S A. 2016 Mar 29;113(13):3573-8. doi: 10.1073/pnas.1520163113. Epub 2016 Mar 14.
10
Numerical analysis of a reinforcement learning model with the dynamic aspiration level in the iterated Prisoner's dilemma.具有迭代囚徒困境中动态期望水平的强化学习模型的数值分析。
J Theor Biol. 2011 Jun 7;278(1):55-62. doi: 10.1016/j.jtbi.2011.03.005. Epub 2011 Mar 29.

引用本文的文献

1
Unsupervised learning of perceptual feature combinations.无监督学习的感知特征组合。
PLoS Comput Biol. 2024 Mar 5;20(3):e1011926. doi: 10.1371/journal.pcbi.1011926. eCollection 2024 Mar.
2
Adversarial Dynamics in Centralized Versus Decentralized Intelligent Systems.集中式与分布式智能系统中的对抗动力学
Top Cogn Sci. 2025 Apr;17(2):374-391. doi: 10.1111/tops.12705. Epub 2023 Oct 30.
3
Evolutionary instability of selfish learning in repeated games.重复博弈中自私学习的进化不稳定性
PNAS Nexus. 2022 Jul 27;1(4):pgac141. doi: 10.1093/pnasnexus/pgac141. eCollection 2022 Sep.
4
Intrinsic fluctuations of reinforcement learning promote cooperation.内在强化学习波动促进合作。
Sci Rep. 2023 Jan 24;13(1):1309. doi: 10.1038/s41598-023-27672-7.
5
Nash equilibria in human sensorimotor interactions explained by Q-learning with intrinsic costs.内禀成本的 Q 学习对人类感觉运动交互中的纳什均衡的解释。
Sci Rep. 2021 Oct 21;11(1):20779. doi: 10.1038/s41598-021-99428-0.
6
Confronting barriers to human-robot cooperation: balancing efficiency and risk in machine behavior.直面人机协作的障碍:平衡机器行为中的效率与风险
iScience. 2020 Dec 17;24(1):101963. doi: 10.1016/j.isci.2020.101963. eCollection 2021 Jan 22.
7
Cooperating with machines.与机器协作。
Nat Commun. 2018 Jan 16;9(1):233. doi: 10.1038/s41467-017-02597-8.
8
Sustainability is possible despite greed - Exploring the nexus between profitability and sustainability in common pool resource systems.尽管存在贪婪,但可持续性是可能的 - 探索共同资源系统中盈利性和可持续性之间的关系。
Sci Rep. 2017 May 23;7(1):2307. doi: 10.1038/s41598-017-02151-y.
9
A game theoretic framework for incentive-based models of intrinsic motivation in artificial systems.基于博弈论的人工系统内在动机激励模型框架。
Front Psychol. 2013 Oct 30;4:791. doi: 10.3389/fpsyg.2013.00791. eCollection 2013.
10
Special agents can promote cooperation in the population.特别代理人可以促进人群中的合作。
PLoS One. 2011;6(12):e29182. doi: 10.1371/journal.pone.0029182. Epub 2011 Dec 21.