Usypchuk Alexandra A, Maes Etienne J P, Lozzi Megan, Avramidis Dimitrios K, Schoenbaum Geoffrey, Esber Guillem R, Gardner Matthew P H, Iordanova Mihaela D
Department of Psychology, Centre for Studies in Behavioural Neurobiology, Concordia University, Montreal, QC H4B 1R6, Canada.
NIDA Intramural Research Program, Baltimore, MD 21224, USA; Departments of Anatomy & Neurobiology and Psychiatry, University of Maryland School of Medicine, Baltimore, MD 21201, USA; Solomon H. Snyder Department of Neuroscience, the Johns Hopkins University, Baltimore, MD 21287, USA.
Curr Biol. 2025 Aug 18;35(16):4019-4027.e7. doi: 10.1016/j.cub.2025.06.076. Epub 2025 Jul 29.
The discovery that midbrain dopamine (DA) transients can be mapped onto reward prediction errors (RPEs), the critical signal that drives learning, is a landmark in neuroscience. Causal support for the RPE hypothesis comes from studies showing that stimulating DA neurons can drive learning under conditions where it would not otherwise occur. However, such stimulation might also promote learning by adding reward value and indirectly inducing an RPE. This added value could support new learning even when it is insufficient to support instrumental behavior. Thus, these competing interpretations are challenging to disentangle and require direct comparison under matched conditions. We developed two computational models grounded in temporal difference reinforcement learning (TDRL) that dissociate the role of DA as an RPE versus a value signal. We validated our models by showing that they both predict learning (unblocking) when ventral tegmental area (VTA) DA stimulation occurs during expected reward delivery in a behavioral blocking design and confirmed this behaviorally. We then contrasted the models by delivering constant optogenetic stimulation during reward across both learning phases of blocking. The value model predicted blocking; the RPE model predicted unblocking. Behavioral results aligned with the latter. Moreover, the RPE model uniquely predicted that constant stimulation would unblock learning at higher frequencies (>20 Hz) when the artificial error alone drives learning. This, too, was confirmed experimentally. We demonstrate a principled computational and empirical dissociation between DA as an RPE versus a value signal. Our results advance understanding of how DA neuron stimulation drives learning.
中脑多巴胺(DA)瞬变可映射到奖励预测误差(RPEs)上,而RPEs是驱动学习的关键信号,这一发现是神经科学领域的一个里程碑。对RPE假说的因果支持来自于一些研究,这些研究表明,在其他情况下不会发生学习的条件下,刺激DA神经元可以驱动学习。然而,这种刺激也可能通过增加奖励价值并间接诱导RPE来促进学习。即使这种增加的价值不足以支持工具性行动,它也可以支持新的学习。因此,这些相互竞争的解释难以区分,需要在匹配条件下进行直接比较。我们开发了两种基于时间差分强化学习(TDRL)的计算模型,它们区分了DA作为RPE与价值信号的作用。我们通过表明在行为阻断设计中,当腹侧被盖区(VTA)DA刺激在预期奖励发放期间发生时,这两种模型都能预测学习(解除阻断),从而验证了我们的模型,并在行为上得到了证实。然后,我们在阻断的两个学习阶段的奖励过程中进行持续的光遗传学刺激,以此来对比这两种模型。价值模型预测会出现阻断;RPE模型预测会解除阻断。行为结果与后者一致。此外,RPE模型独特地预测,当仅人工误差驱动学习时,持续刺激在更高频率(>20 Hz)下会解除学习阻断。这一点也通过实验得到了证实。我们展示了DA作为RPE与价值信号之间在计算和实证方面的原则性区分。我们的结果推进了对DA神经元刺激如何驱动学习的理解。