大脑中的多时间尺度强化学习。
Multi-timescale reinforcement learning in the brain.
作者信息
Masset Paul, Tano Pablo, Kim HyungGoo R, Malik Athar N, Pouget Alexandre, Uchida Naoshige
机构信息
Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA.
Center for Brain Science, Harvard University, Cambridge, MA, USA.
出版信息
Nature. 2025 Jun 4. doi: 10.1038/s41586-025-08929-9.
To thrive in complex environments, animals and artificial agents must learn to act adaptively to maximize fitness and rewards. Such adaptive behaviour can be learned through reinforcement learning, a class of algorithms that has been successful at training artificial agents and at characterizing the firing of dopaminergic neurons in the midbrain. In classical reinforcement learning, agents discount future rewards exponentially according to a single timescale, known as the discount factor. Here we explore the presence of multiple timescales in biological reinforcement learning. We first show that reinforcement agents learning at a multitude of timescales possess distinct computational benefits. Next, we report that dopaminergic neurons in mice performing two behavioural tasks encode reward prediction error with a diversity of discount time constants. Our model explains the heterogeneity of temporal discounting in both cue-evoked transient responses and slower timescale fluctuations known as dopamine ramps. Crucially, the measured discount factor of individual neurons is correlated across the two tasks, suggesting that it is a cell-specific property. Together, our results provide a new paradigm for understanding functional heterogeneity in dopaminergic neurons and a mechanistic basis for the empirical observation that humans and animals use non-exponential discounts in many situations, and open new avenues for the design of more-efficient reinforcement learning algorithms.
为了在复杂环境中茁壮成长,动物和智能体必须学会适应性地行动,以最大化适应性和奖励。这种适应性行为可以通过强化学习来学习,强化学习是一类在训练智能体以及刻画中脑多巴胺能神经元放电方面取得成功的算法。在经典强化学习中,智能体根据一个称为折扣因子的单一时间尺度对未来奖励进行指数折扣。在这里,我们探索生物强化学习中多个时间尺度的存在。我们首先表明,在多个时间尺度上学习的强化智能体具有不同的计算优势。接下来,我们报告在执行两项行为任务的小鼠中,多巴胺能神经元用多种折扣时间常数对奖励预测误差进行编码。我们的模型解释了线索诱发的瞬态反应和被称为多巴胺斜坡的较慢时间尺度波动中时间折扣的异质性。至关重要的是,单个神经元的测量折扣因子在两项任务之间是相关的,这表明它是一种细胞特异性属性。总之,我们的结果为理解多巴胺能神经元的功能异质性提供了一个新范式,为人类和动物在许多情况下使用非指数折扣这一实证观察提供了一个机制基础,并为设计更高效的强化学习算法开辟了新途径。