Kato Ayaka, Morita Kenji
Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, NY 10029-5674, United States.
Postdoctral Fellowship for Research Abroad, Japan Society for the Promotion of Science, Tokyo 102-0083, Japan.
J Neurosci. 2025 Jul 18. doi: 10.1523/JNEUROSCI.0170-25.2025.
Dopamine has been suggested to encode reward-prediction-error (RPE) in reinforcement learning (RL) theory, but also shown to exhibit heterogeneous patterns depending on regions and conditions: some exhibiting ramping response to predictable reward while others only responding to reward-predicting cue. It remains elusive how these heterogeneities relate to various RL algorithms proposed to be employed by animals/humans, such as RL under predictive state representation, hierarchical RL, and distributional RL. Here we demonstrate that these relationships can be coherently explained by incorporating the decay of learned values (value-decay), implementable by the decay of dopamine-dependent plastic changes in the synaptic strengths. First, we show that value-decay causes ramping RPE under certain state representations but not under others. This accounted for the observed gradual fading of dopamine ramping across repeated reward navigation, attributed to the gradual formation of predictive state representations. It also explained the cue-type and inter-trial-interval-dependent temporal patterns of dopamine. Next, we constructed a hierarchical RL model composed of two coupled systems-one with value-decay and one without. The model accounted for distinct patterns of neuronal activity in parallel striatal-dopamine circuits and their proposed roles in flexible learning and stable habit formation. Lastly, we examined two distinct algorithms of distributional RL with and without value-decay. These algorithms explained how distinct dopamine patterns across striatal regions relate to the reported differences in the strength of distributional coding. These results suggest that within-striatum differences-specifically, a medial-to-lateral gradient in value or synaptic decay-tune regional RL computations by generating distinct patterns of dopamine/RPE signals. Dopamine had been considered to universally represent reward-prediction-error for simple reinforcement learning (RL). However, recent studies revealed that dopamine in fact exhibits various patterns depending on regions and conditions. Simultaneously, it has been shown that animals' value learning cannot be always described by simple RL but rather described by more sophisticated algorithms, namely, RL under particular state representations, hierarchical RL, and distributional RL. A major remaining question is mechanistically how various patterns of dopamine are generated and how they achieve various RL computations in different regions and conditions. We present a novel coherent answer to this, in which the key is regional difference/gradient in the degree of the decay of dopamine-dependent plastic changes in the cortico-striatal synapses that store values.
在强化学习(RL)理论中,多巴胺被认为用于编码奖励预测误差(RPE),但也有研究表明,多巴胺会根据区域和条件呈现出不同的模式:一些区域对可预测奖励表现出递增反应,而另一些区域仅对奖励预测线索做出反应。目前尚不清楚这些异质性如何与动物/人类所采用的各种RL算法相关,例如预测状态表示下的RL、分层RL和分布RL。在这里,我们证明通过纳入学习值的衰减(值衰减),可以连贯地解释这些关系,值衰减可通过多巴胺依赖性突触强度可塑性变化的衰减来实现。首先,我们表明值衰减在某些状态表示下会导致递增的RPE,但在其他状态表示下则不会。这解释了在重复奖励导航过程中观察到的多巴胺递增逐渐消失的现象,这归因于预测状态表示的逐渐形成。它还解释了多巴胺的线索类型和试验间隔依赖性时间模式。接下来,我们构建了一个由两个耦合系统组成的分层RL模型,一个系统具有值衰减,另一个系统没有。该模型解释了平行纹状体 - 多巴胺回路中神经元活动的不同模式及其在灵活学习和稳定习惯形成中的作用。最后,我们研究了两种具有和不具有值衰减的分布RL的不同算法。这些算法解释了纹状体区域不同的多巴胺模式如何与报道的分布编码强度差异相关。这些结果表明,纹状体内的差异,特别是值或突触衰减的内侧到外侧梯度,通过产生不同的多巴胺/RPE信号模式来调节区域RL计算。多巴胺一直被认为在简单强化学习(RL)中普遍代表奖励预测误差。然而,最近的研究表明,多巴胺实际上会根据区域和条件呈现出各种模式。同时,研究表明动物的价值学习不能总是用简单的RL来描述,而要用更复杂的算法来描述,即特定状态表示下的RL、分层RL和分布RL。一个主要的遗留问题是,从机制上讲,多巴胺的各种模式是如何产生的,以及它们如何在不同区域和条件下实现各种RL计算。我们对此提出了一个新颖且连贯的答案,其中关键在于存储值的皮质 - 纹状体突触中多巴胺依赖性可塑性变化衰减程度的区域差异/梯度。