Department of Psychology and Center for Neural Science, New York University, New York, New York 10003, USA.
J Neurosci. 2011 Apr 6;31(14):5504-11. doi: 10.1523/JNEUROSCI.6316-10.2011.
Influential reinforcement learning theories propose that prediction error signals in the brain's nigrostriatal system guide learning for trial-and-error decision-making. However, since different decision variables can be learned from quantitatively similar error signals, a critical question is: what is the content of decision representations trained by the error signals? We used fMRI to monitor neural activity in a two-armed bandit counterfactual decision task that provided human subjects with information about forgone and obtained monetary outcomes so as to dissociate teaching signals that update expected values for each action, versus signals that train relative preferences between actions (a policy). The reward probabilities of both choices varied independently from each other. This specific design allowed us to test whether subjects' choice behavior was guided by policy-based methods, which directly map states to advantageous actions, or value-based methods such as Q-learning, where choice policies are instead generated by learning an intermediate representation (reward expectancy). Behaviorally, we found human participants' choices were significantly influenced by obtained as well as forgone rewards from the previous trial. We also found subjects' blood oxygen level-dependent responses in striatum were modulated in opposite directions by the experienced and forgone rewards but not by reward expectancy. This neural pattern, as well as subjects' choice behavior, is consistent with a teaching signal for developing habits or relative action preferences, rather than prediction errors for updating separate action values.
有影响力的强化学习理论提出,大脑黑质纹状体系统中的预测误差信号指导着试错决策的学习。然而,由于不同的决策变量可以从数量上相似的误差信号中学习到,因此一个关键问题是:误差信号训练的决策表示的内容是什么?我们使用 fMRI 监测了在双臂赌博反事实决策任务中的神经活动,该任务为人类受试者提供了关于错过和获得的货币结果的信息,以便区分更新每个动作的预期值的教学信号,与训练动作之间相对偏好(策略)的信号。两个选择的奖励概率彼此独立变化。这种特定的设计使我们能够测试受试者的选择行为是由基于策略的方法指导的,该方法直接将状态映射到有利的动作,还是由 Q-学习等基于价值的方法指导的,在该方法中,选择策略是通过学习中间表示(奖励期望)来生成的。行为上,我们发现人类参与者的选择受到前一次试验中获得的和错过的奖励的显著影响。我们还发现,被试者纹状体的血氧水平依赖反应受到经历过的和错过的奖励的相反调节,但不受奖励期望的调节。这种神经模式以及被试者的选择行为与用于发展习惯或相对动作偏好的教学信号一致,而不是用于更新单独动作值的预测误差。