在一个极简探索-利用任务中次优的来源。

Sources of suboptimality in a minimalistic explore-exploit task.

机构信息

Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA.

Center for Neural Science, New York University, New York, NY, USA.

出版信息

Nat Hum Behav. 2019 Apr;3(4):361-368. doi: 10.1038/s41562-018-0526-x. Epub 2019 Feb 11.

DOI:10.1038/s41562-018-0526-x

PMID:30971784

Abstract

People often choose between sticking with an available good option (exploitation) and trying out a new option that is uncertain but potentially more rewarding (exploration). Laboratory studies on explore-exploit decisions often contain real-world complexities such as non-stationary environments, stochasticity under exploitation and unknown reward distributions. However, such factors might limit the researcher's ability to understand the essence of people's explore-exploit decisions. For this reason, we introduce a minimalistic task in which the optimal policy is to start off exploring and to switch to exploitation at most once in each sequence of decisions. The behaviour of 49 laboratory and 143 online participants deviated both qualitatively and quantitatively from the optimal policy, even when allowing for bias and decision noise. Instead, people seem to follow a suboptimal rule in which they switch from exploration to exploitation when the highest reward so far exceeds a certain threshold. Moreover, we show that this threshold decreases approximately linearly with the proportion of the sequence that remains, suggesting a temporal ratio law. Finally, we find evidence for 'sequence-level' variability that is shared across all decisions in the same sequence. Our results emphasize the importance of examining sequence-level strategies and their variability when studying sequential decision-making.

摘要

人们经常在坚持现有好的选择（开发）和尝试新的、不确定但潜在回报更高的选择（探索）之间做出选择。探索-开发决策的实验室研究通常包含非平稳环境、开发过程中的随机性和未知奖励分布等现实世界的复杂性。然而，这些因素可能会限制研究人员理解人们探索-开发决策本质的能力。基于此，我们引入了一个极简主义的任务，在这个任务中，最优策略是在每个决策序列中最多进行一次探索并切换到开发。49 名实验室参与者和 143 名在线参与者的行为在定性和定量上都偏离了最优策略，即使考虑到偏差和决策噪声也是如此。相反，人们似乎遵循一种次优规则，即在迄今为止最高的奖励超过某个阈值时，从探索切换到开发。此外，我们还表明，这个阈值随序列剩余部分的比例近似线性下降，这表明存在时间比例定律。最后，我们发现了“序列级”可变性的证据，这种可变性在同一序列中的所有决策中都是共享的。我们的研究结果强调了在研究序列决策时，检查序列级策略及其可变性的重要性。