Department of Psychology, Center for Neural Science, New York University, New York, New York 10003, USA.
J Neurosci. 2011 Apr 6;31(14):5526-39. doi: 10.1523/JNEUROSCI.4647-10.2011.
Although reinforcement learning (RL) theories have been influential in characterizing the mechanisms for reward-guided choice in the brain, the predominant temporal difference (TD) algorithm cannot explain many flexible or goal-directed actions that have been demonstrated behaviorally. We investigate such actions by contrasting an RL algorithm that is model based, in that it relies on learning a map or model of the task and planning within it, to traditional model-free TD learning. To distinguish these approaches in humans, we used functional magnetic resonance imaging in a continuous spatial navigation task, in which frequent changes to the layout of the maze forced subjects continually to relearn their favored routes, thereby exposing the RL mechanisms used. We sought evidence for the neural substrates of such mechanisms by comparing choice behavior and blood oxygen level-dependent (BOLD) signals to decision variables extracted from simulations of either algorithm. Both choices and value-related BOLD signals in striatum, although most often associated with TD learning, were better explained by the model-based theory. Furthermore, predecessor quantities for the model-based value computation were correlated with BOLD signals in the medial temporal lobe and frontal cortex. These results point to a significant extension of both the computational and anatomical substrates for RL in the brain.
虽然强化学习 (RL) 理论在描述大脑中奖励引导选择的机制方面具有影响力,但占主导地位的时间差分 (TD) 算法无法解释许多已经在行为上证明的灵活或目标导向的行为。我们通过对比基于模型的 RL 算法与传统的无模型 TD 学习来研究这些行为,因为前者依赖于学习任务的地图或模型并在其中进行规划。为了在人类中区分这些方法,我们在连续空间导航任务中使用了功能磁共振成像,其中迷宫的布局经常发生变化,迫使受试者不断重新学习他们喜欢的路线,从而暴露了所使用的 RL 机制。我们通过将选择行为和血氧水平依赖 (BOLD) 信号与从两种算法的模拟中提取的决策变量进行比较,来寻找这些机制的神经基础的证据。纹状体中的选择和与价值相关的 BOLD 信号虽然最常与 TD 学习相关,但通过基于模型的理论得到了更好的解释。此外,基于模型的价值计算的前体数量与内侧颞叶和额叶皮层中的 BOLD 信号相关。这些结果表明,大脑中的 RL 在计算和解剖学基础方面都有了重大扩展。