Suppr超能文献

在多臂老虎机中寻找结构。

Finding structure in multi-armed bandits.

作者信息

Schulz Eric, Franklin Nicholas T, Gershman Samuel J

机构信息

Harvard University, United States.

Harvard University, United States.

出版信息

Cogn Psychol. 2020 Jun;119:101261. doi: 10.1016/j.cogpsych.2019.101261. Epub 2020 Feb 12.

Abstract

How do humans search for rewards? This question is commonly studied using multi-armed bandit tasks, which require participants to trade off exploration and exploitation. Standard multi-armed bandits assume that each option has an independent reward distribution. However, learning about options independently is unrealistic, since in the real world options often share an underlying structure. We study a class of structured bandit tasks, which we use to probe how generalization guides exploration. In a structured multi-armed bandit, options have a correlation structure dictated by a latent function. We focus on bandits in which rewards are linear functions of an option's spatial position. Across 5 experiments, we find evidence that participants utilize functional structure to guide their exploration, and also exhibit a learning-to-learn effect across rounds, becoming progressively faster at identifying the latent function. Our experiments rule out several heuristic explanations and show that the same findings obtain with non-linear functions. Comparing several models of learning and decision making, we find that the best model of human behavior in our tasks combines three computational mechanisms: (1) function learning, (2) clustering of reward distributions across rounds, and (3) uncertainty-guided exploration. Our results suggest that human reinforcement learning can utilize latent structure in sophisticated ways to improve efficiency.

摘要

人类如何寻找奖励?这个问题通常使用多臂赌博机任务进行研究,这类任务要求参与者在探索和利用之间进行权衡。标准的多臂赌博机假设每个选项都有独立的奖励分布。然而,独立地了解各个选项是不现实的,因为在现实世界中,选项通常共享一个潜在结构。我们研究了一类结构化赌博机任务,用于探究泛化如何引导探索。在结构化多臂赌博机中,选项具有由潜在函数决定的相关结构。我们关注奖励是选项空间位置的线性函数的赌博机。通过5个实验,我们发现参与者利用函数结构来引导他们的探索,并且在各轮中还表现出学习学习效应,在识别潜在函数方面变得越来越快。我们的实验排除了几种启发式解释,并表明非线性函数也能得到相同的结果。比较几种学习和决策模型,我们发现在我们的任务中,人类行为的最佳模型结合了三种计算机制:(1)函数学习,(2)各轮奖励分布的聚类,以及(3)不确定性引导的探索。我们的结果表明,人类强化学习可以以复杂的方式利用潜在结构来提高效率。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验