Schulz Eric, Franklin Nicholas T, Gershman Samuel J
Harvard University, United States.
Harvard University, United States.
Cogn Psychol. 2020 Jun;119:101261. doi: 10.1016/j.cogpsych.2019.101261. Epub 2020 Feb 12.
How do humans search for rewards? This question is commonly studied using multi-armed bandit tasks, which require participants to trade off exploration and exploitation. Standard multi-armed bandits assume that each option has an independent reward distribution. However, learning about options independently is unrealistic, since in the real world options often share an underlying structure. We study a class of structured bandit tasks, which we use to probe how generalization guides exploration. In a structured multi-armed bandit, options have a correlation structure dictated by a latent function. We focus on bandits in which rewards are linear functions of an option's spatial position. Across 5 experiments, we find evidence that participants utilize functional structure to guide their exploration, and also exhibit a learning-to-learn effect across rounds, becoming progressively faster at identifying the latent function. Our experiments rule out several heuristic explanations and show that the same findings obtain with non-linear functions. Comparing several models of learning and decision making, we find that the best model of human behavior in our tasks combines three computational mechanisms: (1) function learning, (2) clustering of reward distributions across rounds, and (3) uncertainty-guided exploration. Our results suggest that human reinforcement learning can utilize latent structure in sophisticated ways to improve efficiency.
人类如何寻找奖励?这个问题通常使用多臂赌博机任务进行研究,这类任务要求参与者在探索和利用之间进行权衡。标准的多臂赌博机假设每个选项都有独立的奖励分布。然而,独立地了解各个选项是不现实的,因为在现实世界中,选项通常共享一个潜在结构。我们研究了一类结构化赌博机任务,用于探究泛化如何引导探索。在结构化多臂赌博机中,选项具有由潜在函数决定的相关结构。我们关注奖励是选项空间位置的线性函数的赌博机。通过5个实验,我们发现参与者利用函数结构来引导他们的探索,并且在各轮中还表现出学习学习效应,在识别潜在函数方面变得越来越快。我们的实验排除了几种启发式解释,并表明非线性函数也能得到相同的结果。比较几种学习和决策模型,我们发现在我们的任务中,人类行为的最佳模型结合了三种计算机制:(1)函数学习,(2)各轮奖励分布的聚类,以及(3)不确定性引导的探索。我们的结果表明,人类强化学习可以以复杂的方式利用潜在结构来提高效率。