在多臂老虎机中寻找结构。

Finding structure in multi-armed bandits.

作者信息

Schulz Eric, Franklin Nicholas T, Gershman Samuel J

机构信息

Harvard University, United States.

出版信息

Cogn Psychol. 2020 Jun;119:101261. doi: 10.1016/j.cogpsych.2019.101261. Epub 2020 Feb 12.

DOI:10.1016/j.cogpsych.2019.101261

PMID:32059133

Abstract

How do humans search for rewards? This question is commonly studied using multi-armed bandit tasks, which require participants to trade off exploration and exploitation. Standard multi-armed bandits assume that each option has an independent reward distribution. However, learning about options independently is unrealistic, since in the real world options often share an underlying structure. We study a class of structured bandit tasks, which we use to probe how generalization guides exploration. In a structured multi-armed bandit, options have a correlation structure dictated by a latent function. We focus on bandits in which rewards are linear functions of an option's spatial position. Across 5 experiments, we find evidence that participants utilize functional structure to guide their exploration, and also exhibit a learning-to-learn effect across rounds, becoming progressively faster at identifying the latent function. Our experiments rule out several heuristic explanations and show that the same findings obtain with non-linear functions. Comparing several models of learning and decision making, we find that the best model of human behavior in our tasks combines three computational mechanisms: (1) function learning, (2) clustering of reward distributions across rounds, and (3) uncertainty-guided exploration. Our results suggest that human reinforcement learning can utilize latent structure in sophisticated ways to improve efficiency.

摘要

人类如何寻找奖励？这个问题通常使用多臂赌博机任务进行研究，这类任务要求参与者在探索和利用之间进行权衡。标准的多臂赌博机假设每个选项都有独立的奖励分布。然而，独立地了解各个选项是不现实的，因为在现实世界中，选项通常共享一个潜在结构。我们研究了一类结构化赌博机任务，用于探究泛化如何引导探索。在结构化多臂赌博机中，选项具有由潜在函数决定的相关结构。我们关注奖励是选项空间位置的线性函数的赌博机。通过5个实验，我们发现参与者利用函数结构来引导他们的探索，并且在各轮中还表现出学习学习效应，在识别潜在函数方面变得越来越快。我们的实验排除了几种启发式解释，并表明非线性函数也能得到相同的结果。比较几种学习和决策模型，我们发现在我们的任务中，人类行为的最佳模型结合了三种计算机制：（1）函数学习，（2）各轮奖励分布的聚类，以及（3）不确定性引导的探索。我们的结果表明，人类强化学习可以以复杂的方式利用潜在结构来提高效率。

相似文献

Finding structure in multi-armed bandits.在多臂老虎机中寻找结构。

Cogn Psychol. 2020 Jun;119:101261. doi: 10.1016/j.cogpsych.2019.101261. Epub 2020 Feb 12.

It's new, but is it good? How generalization and uncertainty guide the exploration of novel options.这是新的，但它好吗？概括和不确定性如何指导对新选项的探索。

J Exp Psychol Gen. 2020 Oct;149(10):1878-1907. doi: 10.1037/xge0000749. Epub 2020 Mar 19.

Putting bandits into context: How function learning supports decision making.将匪帮置于情境中：功能学习如何支持决策制定。

J Exp Psychol Learn Mem Cogn. 2018 Jun;44(6):927-943. doi: 10.1037/xlm0000463. Epub 2017 Nov 13.

Uncertainty and exploration in a restless bandit problem.动态强盗问题中的不确定性与探索

Top Cogn Sci. 2015 Apr;7(2):351-67. doi: 10.1111/tops.12145. Epub 2015 Apr 20.

PLoS Comput Biol. 2020 Sep 9;16(9):e1008149. doi: 10.1371/journal.pcbi.1008149. eCollection 2020 Sep.

Sex differences in learning from exploration.从探索中学习的性别差异。

Elife. 2021 Nov 19;10:e69748. doi: 10.7554/eLife.69748.

Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?基于筛沙机制的超越方法：为何乐观值函数能在多臂老虎机问题中找到最优解？

Biosystems. 2015 Sep;135:55-65. doi: 10.1016/j.biosystems.2015.06.009. Epub 2015 Jul 10.

Development of directed and random exploration in children.儿童定向和随机探索的发展。

Dev Sci. 2021 Jul;24(4):e13095. doi: 10.1111/desc.13095. Epub 2021 Mar 8.

Revisiting the Role of Uncertainty-Driven Exploration in a (Perceived) Non-Stationary World.重新审视不确定性驱动的探索在（感知到的）非平稳世界中的作用。

Cogsci. 2021 Jul;43:2045-2051.

Dopamine blockade impairs the exploration-exploitation trade-off in rats.多巴胺阻断会损害大鼠的探索-利用权衡。

Sci Rep. 2019 May 1;9(1):6770. doi: 10.1038/s41598-019-43245-z.

引用本文的文献

Two types of motifs enhance human recall and generalization of long sequences.两种基序增强了人类对长序列的记忆和泛化能力。

Commun Psychol. 2025 Jan 7;3(1):3. doi: 10.1038/s44271-024-00180-8.

Identifying Transfer Learning in the Reshaping of Inductive Biases.在归纳偏差重塑中识别迁移学习。

Open Mind (Camb). 2024 Sep 15;8:1107-1128. doi: 10.1162/opmi_a_00158. eCollection 2024.

Revisiting the role of computational neuroimaging in the era of integrative neuroscience.重新审视计算神经影像学在整合神经科学时代的作用。

Neuropsychopharmacology. 2024 Nov;50(1):103-113. doi: 10.1038/s41386-024-01946-8. Epub 2024 Sep 6.

A Markovian dynamics for behavior across scales.跨越多个尺度的行为的马尔可夫动力学。

Proc Natl Acad Sci U S A. 2024 Aug 6;121(32):e2318805121. doi: 10.1073/pnas.2318805121. Epub 2024 Jul 31.

Reconciling shared versus context-specific information in a neural network model of latent causes.在潜在因果关系的神经网络模型中协调共享信息和特定上下文信息。

Sci Rep. 2024 Jul 22;14(1):16782. doi: 10.1038/s41598-024-64272-5.

Frequent winners explain apparent skewness preferences in experience-based decisions.频繁的赢家解释了基于经验的决策中明显的偏斜偏好。

Proc Natl Acad Sci U S A. 2024 Mar 19;121(12):e2317751121. doi: 10.1073/pnas.2317751121. Epub 2024 Mar 15.

Designing optimal behavioral experiments using machine learning.使用机器学习设计最优行为实验。

Elife. 2024 Jan 23;13:e86224. doi: 10.7554/eLife.86224.

Computational mechanisms underlying latent value updating of unchosen actions.潜在未选动作价值更新的计算机制。

Sci Adv. 2023 Oct 20;9(42):eadi2704. doi: 10.1126/sciadv.adi2704.

Naturalistic reinforcement learning.自然强化学习。

Trends Cogn Sci. 2024 Feb;28(2):144-158. doi: 10.1016/j.tics.2023.08.016. Epub 2023 Sep 29.

The form of uncertainty affects selection for social learning.不确定性的形式会影响社会学习的选择。

Evol Hum Sci. 2023 May 22;5:e20. doi: 10.1017/ehs.2023.11. eCollection 2023.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

在多臂老虎机中寻找结构。

Finding structure in multi-armed bandits.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献