Colas Jaron T, Pauli Wolfgang M, Larsen Tobias, Tyszka J Michael, O'Doherty John P
Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA, United States of America.
Division of the Humanities and Social Sciences, California Institute of Technology, Pasadena, CA, United States of America.
PLoS Comput Biol. 2017 Oct 19;13(10):e1005810. doi: 10.1371/journal.pcbi.1005810. eCollection 2017 Oct.
Prediction-error signals consistent with formal models of "reinforcement learning" (RL) have repeatedly been found within dopaminergic nuclei of the midbrain and dopaminoceptive areas of the striatum. However, the precise form of the RL algorithms implemented in the human brain is not yet well determined. Here, we created a novel paradigm optimized to dissociate the subtypes of reward-prediction errors that function as the key computational signatures of two distinct classes of RL models-namely, "actor/critic" models and action-value-learning models (e.g., the Q-learning model). The state-value-prediction error (SVPE), which is independent of actions, is a hallmark of the actor/critic architecture, whereas the action-value-prediction error (AVPE) is the distinguishing feature of action-value-learning algorithms. To test for the presence of these prediction-error signals in the brain, we scanned human participants with a high-resolution functional magnetic-resonance imaging (fMRI) protocol optimized to enable measurement of neural activity in the dopaminergic midbrain as well as the striatal areas to which it projects. In keeping with the actor/critic model, the SVPE signal was detected in the substantia nigra. The SVPE was also clearly present in both the ventral striatum and the dorsal striatum. However, alongside these purely state-value-based computations we also found evidence for AVPE signals throughout the striatum. These high-resolution fMRI findings suggest that model-free aspects of reward learning in humans can be explained algorithmically with RL in terms of an actor/critic mechanism operating in parallel with a system for more direct action-value learning.
与“强化学习”(RL)形式模型一致的预测误差信号,已多次在中脑多巴胺能核团和纹状体的多巴胺感受区被发现。然而,人类大脑中实施的RL算法的确切形式尚未完全确定。在这里,我们创建了一种新颖的范式,该范式经过优化,以区分奖励预测误差的亚型,这些亚型作为两类不同RL模型(即“行动者/评论者”模型和行动价值学习模型,如Q学习模型)的关键计算特征。独立于行动的状态价值预测误差(SVPE)是行动者/评论者架构的标志,而行动价值预测误差(AVPE)是行动价值学习算法的显著特征。为了测试大脑中这些预测误差信号的存在,我们使用高分辨率功能磁共振成像(fMRI)协议对人类受试者进行扫描,该协议经过优化,能够测量多巴胺能中脑及其投射到的纹状体区域的神经活动。与行动者/评论者模型一致,在黑质中检测到了SVPE信号。SVPE在腹侧纹状体和背侧纹状体中也清晰可见。然而,除了这些基于纯粹状态价值的计算外,我们还在整个纹状体中发现了AVPE信号的证据。这些高分辨率fMRI研究结果表明,人类奖励学习的无模型方面可以用RL算法来解释,即行动者/评论者机制与更直接的行动价值学习系统并行运作。