Masset Paul, Tano Pablo, Kim HyungGoo R, Malik Athar N, Pouget Alexandre, Uchida Naoshige
Department of Molecular and Cellular Biology, Harvard University, USA.
Center for Brain Science, Harvard University, USA.
bioRxiv. 2023 Nov 14:2023.11.12.566754. doi: 10.1101/2023.11.12.566754.
To thrive in complex environments, animals and artificial agents must learn to act adaptively to maximize fitness and rewards. Such adaptive behavior can be learned through reinforcement learning, a class of algorithms that has been successful at training artificial agents and at characterizing the firing of dopamine neurons in the midbrain. In classical reinforcement learning, agents discount future rewards exponentially according to a single time scale, controlled by the discount factor. Here, we explore the presence of multiple timescales in biological reinforcement learning. We first show that reinforcement agents learning at a multitude of timescales possess distinct computational benefits. Next, we report that dopamine neurons in mice performing two behavioral tasks encode reward prediction error with a diversity of discount time constants. Our model explains the heterogeneity of temporal discounting in both cue-evoked transient responses and slower timescale fluctuations known as dopamine ramps. Crucially, the measured discount factor of individual neurons is correlated across the two tasks suggesting that it is a cell-specific property. Together, our results provide a new paradigm to understand functional heterogeneity in dopamine neurons, a mechanistic basis for the empirical observation that humans and animals use non-exponential discounts in many situations, and open new avenues for the design of more efficient reinforcement learning algorithms.
为了在复杂环境中茁壮成长,动物和智能体必须学会适应性地行动,以最大化适应性和奖励。这种适应性行为可以通过强化学习来学习,强化学习是一类算法,在训练智能体以及表征中脑多巴胺神经元的放电方面都取得了成功。在经典强化学习中,智能体根据由折扣因子控制的单一时间尺度,对未来奖励进行指数折扣。在此,我们探索生物强化学习中多个时间尺度的存在。我们首先表明,在多个时间尺度上学习的强化智能体具有不同的计算优势。接下来,我们报告说,执行两项行为任务的小鼠中的多巴胺神经元,用多种折扣时间常数对奖励预测误差进行编码。我们的模型解释了线索诱发的瞬态反应和被称为多巴胺斜坡的较慢时间尺度波动中时间折扣的异质性。至关重要的是,在两项任务中,单个神经元的测量折扣因子是相关的,这表明它是一种细胞特异性属性。总之,我们的结果提供了一个新的范式来理解多巴胺神经元中的功能异质性,为人类和动物在许多情况下使用非指数折扣这一实证观察提供了一个机制基础,并为设计更高效的强化学习算法开辟了新途径。