Sousa Margarida, Bujalski Pawel, Cruz Bruno F, Louie Kenway, McNamee Daniel C, Paton Joseph J
Champalimaud Centre for the Unknown, Lisbon, Portugal.
Allen Institute for Neural Dynamics, Seattle, WA, USA.
Nature. 2025 Jun;642(8068):691-699. doi: 10.1038/s41586-025-09089-6. Epub 2025 Jun 4.
Midbrain dopamine neurons (DANs) signal reward-prediction errors that teach recipient circuits about expected rewards. However, DANs are thought to provide a substrate for temporal difference (TD) reinforcement learning (RL), an algorithm that learns the mean of temporally discounted expected future rewards, discarding useful information about experienced distributions of reward amounts and delays. Here we present time-magnitude RL (TMRL), a multidimensional variant of distributional RL that learns the joint distribution of future rewards over time and magnitude. We also uncover signatures of TMRL-like computations in the activity of optogenetically identified DANs in mice during behaviour. Specifically, we show that there is significant diversity in both temporal discounting and tuning for the reward magnitude across DANs. These features allow the computation of a two-dimensional, probabilistic map of future rewards from just 450 ms of the DAN population response to a reward-predictive cue. Furthermore, reward-time predictions derived from this code correlate with anticipatory behaviour, suggesting that similar information is used to guide decisions about when to act. Finally, by simulating behaviour in a foraging environment, we highlight the benefits of a joint probability distribution of reward over time and magnitude in the face of dynamic reward landscapes and internal states. These findings show that rich probabilistic reward information is learnt and communicated to DANs, and suggest a simple, local-in-time extension of TD algorithms that explains how such information might be acquired and computed.
中脑多巴胺能神经元(DANs)发出奖励预测误差信号,向接收回路传授预期奖励的相关信息。然而,DANs被认为是为时间差分(TD)强化学习(RL)提供了一种基础,TD强化学习是一种学习时间折扣预期未来奖励均值的算法,它丢弃了关于奖励数量和延迟的经验分布的有用信息。在此,我们提出了时间-量级强化学习(TMRL),这是一种分布强化学习的多维变体,它学习未来奖励随时间和量级的联合分布。我们还在小鼠行为期间光遗传学识别的DANs的活动中发现了类似TMRL计算的特征。具体而言,我们表明,DANs在时间折扣和奖励量级调谐方面都存在显著差异。这些特征使得仅从DAN群体对奖励预测线索450毫秒的反应中就能计算出未来奖励的二维概率图。此外,从该编码得出的奖励时间预测与预期行为相关,这表明类似的信息被用于指导关于何时行动的决策。最后,通过模拟觅食环境中的行为,我们突出了在面对动态奖励格局和内部状态时,奖励随时间和量级的联合概率分布的益处。这些发现表明,丰富的概率性奖励信息被学习并传递给DANs,并提出了一种TD算法的简单、即时局部扩展,解释了此类信息可能是如何获取和计算的。
Nature. 2025-6-4
Autism Adulthood. 2025-5-28
Autism Adulthood. 2025-5-28
Autism Adulthood. 2025-5-28
Cochrane Database Syst Rev. 2025-6-20
Elife. 2025-8-29
bioRxiv. 2024-4-23
Nat Neurosci. 2024-8
Nat Neurosci. 2024-7
Nat Neurosci. 2024-3
Science. 2022-12-23
PLoS Comput Biol. 2022-7
Nat Neurosci. 2020-1-13