Schulz Eric, Konstantinidis Emmanouil, Speekenbrink Maarten
Department of Experimental Psychology.
School of Psychology, University of New South Wales.
J Exp Psychol Learn Mem Cogn. 2018 Jun;44(6):927-943. doi: 10.1037/xlm0000463. Epub 2017 Nov 13.
The authors introduce the contextual multi-armed bandit task as a framework to investigate learning and decision making in uncertain environments. In this novel paradigm, participants repeatedly choose between multiple options in order to maximize their rewards. The options are described by a number of contextual features which are predictive of the rewards through initially unknown functions. From their experience with choosing options and observing the consequences of their decisions, participants can learn about the functional relation between contexts and rewards and improve their decision strategy over time. In three experiments, the authors explore participants' behavior in such learning environments. They predict participants' behavior by context-blind (mean-tracking, Kalman filter) and contextual (Gaussian process and linear regression) learning approaches combined with different choice strategies. Participants are mostly able to learn about the context-reward functions and their behavior is best described by a Gaussian process learning strategy which generalizes previous experience to similar instances. In a relatively simple task with binary features, they seem to combine this learning with a probability of improvement decision strategy which focuses on alternatives that are expected to lead to an improvement upon a current favorite option. In a task with continuous features that are linearly related to the rewards, participants seem to more explicitly balance exploration and exploitation. Finally, in a difficult learning environment where the relation between features and rewards is nonlinear, some participants are again well-described by a Gaussian process learning strategy, whereas others revert to context-blind strategies. (PsycINFO Database Record
作者引入上下文多臂赌博机任务作为一个框架,用于研究不确定环境中的学习和决策。在这个新颖的范式中,参与者在多个选项之间反复进行选择,以最大化他们的奖励。这些选项由一些上下文特征来描述,这些特征通过最初未知的函数来预测奖励。通过选择选项并观察其决策后果的经验,参与者可以了解上下文与奖励之间的函数关系,并随着时间的推移改进他们的决策策略。在三个实验中,作者探究了参与者在这种学习环境中的行为。他们通过结合不同选择策略的上下文盲(均值跟踪、卡尔曼滤波器)和上下文(高斯过程和线性回归)学习方法来预测参与者的行为。参与者大多能够了解上下文 - 奖励函数,并且他们的行为最好用高斯过程学习策略来描述,该策略将先前的经验推广到类似的情况。在一个具有二元特征的相对简单的任务中,他们似乎将这种学习与改进概率决策策略相结合,该策略关注那些有望比当前最喜欢的选项带来改进的替代方案。在一个具有与奖励线性相关的连续特征的任务中,参与者似乎更明确地平衡探索和利用。最后,在一个特征与奖励之间的关系是非线性的困难学习环境中,一些参与者再次能用高斯过程学习策略很好地描述,而另一些参与者则回归到上下文盲策略。(PsycINFO数据库记录)