Zhou Corey Yishan, Guo Dalin, Yu Angela J
Department of Cognitive Science, University of California, San Diego La Jolla, CA 92093 USA.
Cogsci. 2020 Jul-Aug;42:1682-1688.
Humans frequently overestimate the likelihood of desirable events while underestimating the likelihood of undesirable ones: a phenomenon known as . Previously, it was suggested that unrealistic optimism arises from asymmetric belief updating, with a relatively reduced coding of undesirable information. Prior studies have shown that a reinforcement learning (RL) model with asymmetric learning rates (greater for a positive prediction error than a negative prediction error) could account for unrealistic optimism in a bandit task, in particular the tendency of human subjects to persistently choosing a single option when there are multiple equally good options. Here, we propose an alternative explanation of such persistent behavior, by modeling human behavior using a Bayesian hidden Markov model, the Dynamic Belief Model (DBM). We find that DBM captures human choice behavior better than the previously proposed asymmetric RL model. Whereas asymmetric RL attains a measure of optimism by giving better-than-expected outcomes higher learning weights compared to worse-than-expected outcomes, DBM does so by progressively devaluing the unchosen options, thus placing a greater emphasis on independent of reward outcome (e.g. an oft-chosen option might continue to be preferred even if it has not been particularly rewarding), which has broadly been shown to underlie sequential effects in a variety of behavioral settings. Moreover, previous work showed that the devaluation of unchosen options in DBM helps to compensate for a default assumption of environmental non-stationarity, thus allowing the decision-maker to both be more adaptive in changing environments and still obtain near-optimal performance in stationary environments. Thus, the current work suggests both a novel rationale and mechanism for persistent behavior in bandit tasks.
人类常常高估合意事件发生的可能性,同时低估不合意事件发生的可能性:这一现象被称为 。此前,有人提出不切实际的乐观主义源于不对称的信念更新,即对不合意信息的编码相对减少。先前的研究表明,一种具有不对称学习率(正预测误差的学习率大于负预测误差)的强化学习(RL)模型可以解释在强盗任务中不切实际的乐观主义,特别是人类受试者在有多个同样好的选项时持续选择单一选项的倾向。在这里,我们通过使用贝叶斯隐马尔可夫模型——动态信念模型(DBM)对人类行为进行建模,提出了对这种持续行为的另一种解释。我们发现DBM比先前提出的不对称RL模型能更好地捕捉人类的选择行为。与不对称RL通过给予比预期更好的结果更高的学习权重(相比比预期更差的结果)来达到一定程度的乐观主义不同,DBM是通过逐步贬低未被选择的选项来做到这一点的,从而更加强调 独立于奖励结果(例如,一个经常被选择的选项可能即使没有特别丰厚的回报也会继续受到青睐),这在各种行为情境中已被广泛证明是序列效应的基础。此外,先前的研究表明,DBM中未被选择选项的贬值有助于补偿对环境非平稳性的默认假设,从而使决策者在变化的环境中更具适应性,同时在稳定的环境中仍能获得接近最优的表现。因此,当前的研究工作为强盗任务中的持续行为提出了一个新颖的基本原理和机制。