• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

学会与任何对手组合进行对抗。

Learning to play against any mixture of opponents.

作者信息

Smith Max Olan, Anthony Thomas, Wellman Michael P

机构信息

Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, United States.

DeepMind, London, United Kingdom.

出版信息

Front Artif Intell. 2023 Jul 20;6:804682. doi: 10.3389/frai.2023.804682. eCollection 2023.

DOI:10.3389/frai.2023.804682
PMID:37547229
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10400709/
Abstract

Intuitively, experience playing against one mixture of opponents in a given domain should be relevant for a different mixture in the same domain. If the mixture changes, ideally we would not have to train from scratch, but rather could transfer what we have learned to construct a policy to play against the new mixture. We propose a transfer learning method, , that starts by learning -values against each pure-strategy opponent. Then a -value for distribution of opponent strategies is approximated by appropriately averaging the separately learned -values. From these components, we construct policies against all opponent mixtures without any further training. We empirically validate Q-Mixing in two environments: a simple grid-world soccer environment, and a social dilemma game. Our experiments find that Q-Mixing can successfully transfer knowledge across any mixture of opponents. Next, we consider the use of observations during play to update the believed distribution of opponents. We introduce an opponent policy classifier-trained reusing Q-learning data-and use the classifier results to refine the mixing of -values. Q-Mixing augmented with the opponent policy classifier performs better, with higher variance, than training directly against a mixed-strategy opponent.

摘要

直观地说,在给定领域中与一种对手混合策略进行对抗的经验应该与同一领域中的另一种混合策略相关。如果混合策略发生变化,理想情况下我们不必从头开始训练,而是可以转移我们所学的知识来构建一个策略以对抗新的混合策略。我们提出了一种迁移学习方法,即Q-Mixing,它首先针对每个纯策略对手学习Q值。然后,通过对单独学习的Q值进行适当平均,来近似对手策略分布的Q值。基于这些组件,我们无需任何进一步训练即可构建针对所有对手混合策略的策略。我们在两种环境中对Q-Mixing进行了实证验证:一个简单的网格世界足球环境和一个社会困境游戏。我们的实验发现,Q-Mixing可以成功地在任何对手混合策略之间转移知识。接下来,我们考虑在游戏过程中使用观察结果来更新对手的置信分布。我们引入了一个对手策略分类器——利用Q学习数据进行训练——并使用分类器结果来优化Q值的混合。与直接针对混合策略对手进行训练相比,使用对手策略分类器增强后的Q-Mixing表现更好,但方差更高。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/bc3bb8e099d9/frai-06-804682-g0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/8320e92cf18d/frai-06-804682-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/ebc2bf06c154/frai-06-804682-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/7af32d09b2aa/frai-06-804682-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/54457f3fd89f/frai-06-804682-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/415e3dc742f5/frai-06-804682-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/23e10bea5679/frai-06-804682-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/bc3bb8e099d9/frai-06-804682-g0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/8320e92cf18d/frai-06-804682-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/ebc2bf06c154/frai-06-804682-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/7af32d09b2aa/frai-06-804682-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/54457f3fd89f/frai-06-804682-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/415e3dc742f5/frai-06-804682-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/23e10bea5679/frai-06-804682-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fb0/10400709/bc3bb8e099d9/frai-06-804682-g0007.jpg

相似文献

1
Learning to play against any mixture of opponents.学会与任何对手组合进行对抗。
Front Artif Intell. 2023 Jul 20;6:804682. doi: 10.3389/frai.2023.804682. eCollection 2023.
2
Multiagent reinforcement learning in the Iterated Prisoner's Dilemma.重复囚徒困境中的多智能体强化学习
Biosystems. 1996;37(1-2):147-66. doi: 10.1016/0303-2647(95)01551-5.
3
Opponent Identity Influences Value Learning in Simple Games.对手身份会影响简单游戏中的价值学习。
J Neurosci. 2015 Aug 5;35(31):11133-43. doi: 10.1523/JNEUROSCI.3530-14.2015.
4
Learning Macromanagement in Starcraft by Deep Reinforcement Learning.通过深度强化学习学习星际争霸中的宏观操作。
Sensors (Basel). 2021 May 11;21(10):3332. doi: 10.3390/s21103332.
5
Learning agile soccer skills for a bipedal robot with deep reinforcement learning.使用深度强化学习为双足机器人学习敏捷的足球技能。
Sci Robot. 2024 Apr 10;9(89):eadi8022. doi: 10.1126/scirobotics.adi8022.
6
You Were Always on My Mind: Introducing Chef's Hat and COPPER for Personalized Reinforcement Learning.你一直在我心中:介绍厨师帽和用于个性化强化学习的COPPER。
Front Robot AI. 2021 Jul 16;8:669990. doi: 10.3389/frobt.2021.669990. eCollection 2021.
7
Multi-agent reinforcement learning with approximate model learning for competitive games.多智能体强化学习与近似模型学习在竞争性游戏中的应用。
PLoS One. 2019 Sep 11;14(9):e0222215. doi: 10.1371/journal.pone.0222215. eCollection 2019.
8
All by Myself: Learning individualized competitive behavior with a contrastive reinforcement learning optimization.独自学习:用对比强化学习优化来学习个性化竞争行为。
Neural Netw. 2022 Jun;150:364-376. doi: 10.1016/j.neunet.2022.03.013. Epub 2022 Mar 18.
9
Breaking the bonds of reinforcement: Effects of trial outcome, rule consistency and rule complexity against exploitable and unexploitable opponents.打破强化的束缚:试验结果、规则一致性和规则复杂性对可利用和不可利用对手的影响。
PLoS One. 2022 Feb 2;17(2):e0262249. doi: 10.1371/journal.pone.0262249. eCollection 2022.
10
Adaptive pessimism via target Q-value for offline reinforcement learning.基于目标 Q 值的离线强化学习自适应悲观主义。
Neural Netw. 2024 Dec;180:106588. doi: 10.1016/j.neunet.2024.106588. Epub 2024 Aug 5.

本文引用的文献

1
Human-level performance in 3D multiplayer games with population-based reinforcement learning.基于群体强化学习的 3D 多人游戏中的人类水平表现。
Science. 2019 May 31;364(6443):859-865. doi: 10.1126/science.aau6249.
2
Adaptive Mixtures of Local Experts.局部专家的自适应混合模型
Neural Comput. 1991 Spring;3(1):79-87. doi: 10.1162/neco.1991.3.1.79.