Suppr
超能文献

多巴胺基强化学习中的价值分布代码。

A distributional code for value in dopamine-based reinforcement learning.

机构信息

DeepMind, London, UK.

Max Planck UCL Centre for Computational Psychiatry and Ageing Research, University College London, London, UK.

出版信息

Nature. 2020 Jan;577(7792):671-675. doi: 10.1038/s41586-019-1924-6. Epub 2020 Jan 15.

DOI:10.1038/s41586-019-1924-6

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7476215/

Abstract

Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain. According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning. We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning.

摘要

自提出以来，多巴胺的奖励预测误差理论解释了大量的经验现象，为理解大脑中奖励和价值的表示提供了一个统一的框架。根据现在的规范理论，奖励预测被表示为一个单一的标量量，支持对随机结果的期望或均值的学习。在这里，我们提出了一种基于多巴胺的强化学习的解释，这是受最近关于分布强化学习的人工智能研究的启发。我们假设大脑不是以单一的平均值，而是以概率分布的形式来表示可能的未来奖励，有效地同时并行地表示多个未来结果。这个想法暗示了一系列经验预测，我们使用来自小鼠腹侧被盖区的单个单元记录来测试这些预测。我们的发现为分布强化学习的神经实现提供了强有力的证据。

相似文献

1

A distributional code for value in dopamine-based reinforcement learning.

Nature. 2020 Jan;577(7792):671-675. doi: 10.1038/s41586-019-1924-6. Epub 2020 Jan 15.

2

Arithmetic and local circuitry underlying dopamine prediction errors.

Nature. 2015 Sep 10;525(7568):243-6. doi: 10.1038/nature14855. Epub 2015 Aug 31.

3

Optogenetic mimicry of the transient activation of dopamine neurons by natural reward is sufficient for operant reinforcement.

PLoS One. 2012;7(4):e33612. doi: 10.1371/journal.pone.0033612. Epub 2012 Apr 10.

4

Systems Neuroscience: Shaping the Reward Prediction Error Signal.

Curr Biol. 2015 Nov 16;25(22):R1081-4. doi: 10.1016/j.cub.2015.09.057.

5

Neuron-type-specific signals for reward and punishment in the ventral tegmental area.

Nature. 2012 Jan 18;482(7383):85-8. doi: 10.1038/nature10754.

6

Optogenetic Blockade of Dopamine Transients Prevents Learning Induced by Changes in Reward Features.

Curr Biol. 2017 Nov 20;27(22):3480-3486.e3. doi: 10.1016/j.cub.2017.09.049. Epub 2017 Nov 2.

7

A feature-specific prediction error model explains dopaminergic heterogeneity.

Nat Neurosci. 2024 Aug;27(8):1574-1586. doi: 10.1038/s41593-024-01689-1. Epub 2024 Jul 3.

8

Ventral Tegmental Dopamine Neurons Participate in Reward Identity Predictions.

Curr Biol. 2019 Jan 7;29(1):93-103.e3. doi: 10.1016/j.cub.2018.11.050. Epub 2018 Dec 20.

9

Dopaminergic prediction errors in the ventral tegmental area reflect a multithreaded predictive model.

Nat Neurosci. 2023 May;26(5):830-839. doi: 10.1038/s41593-023-01310-x. Epub 2023 Apr 20.

10

Neuronal implementation of the temporal difference learning algorithm in the midbrain dopaminergic system.

Proc Natl Acad Sci U S A. 2023 Nov 7;120(45):e2309015120. doi: 10.1073/pnas.2309015120. Epub 2023 Oct 30.

引用本文的文献

1

Correctness is its own reward: bootstrapping error signals in self-guided reinforcement learning.

bioRxiv. 2025 Aug 19:2025.07.18.665446. doi: 10.1101/2025.07.18.665446.

2

Tonic dopamine and biases in value learning linked through a biologically inspired reinforcement learning model.

Nat Commun. 2025 Aug 13;16(1):7529. doi: 10.1038/s41467-025-62280-1.

3

Striatal Gradient in Value-Decay Explains Regional Differences in Dopamine Patterns and Reinforcement Learning Computations.

J Neurosci. 2025 Jul 18. doi: 10.1523/JNEUROSCI.0170-25.2025.

4

Experience-based risk taking is primarily shaped by prior learning rather than by decision-making.

Nat Commun. 2025 Jul 9;16(1):6310. doi: 10.1038/s41467-025-61609-0.

5

The interoceptive origin of reinforcement learning.

Trends Cogn Sci. 2025 Sep;29(9):840-854. doi: 10.1016/j.tics.2025.05.008. Epub 2025 Jun 10.

6

A multidimensional distributional map of future reward in dopamine neurons.

Nature. 2025 Jun;642(8068):691-699. doi: 10.1038/s41586-025-09089-6. Epub 2025 Jun 4.

7

Multi-timescale reinforcement learning in the brain.

Nature. 2025 Jun 4. doi: 10.1038/s41586-025-08929-9.

8

Distinct spatially organized striatum-wide acetylcholine dynamics for the learning and extinction of Pavlovian associations.

Nat Commun. 2025 Jun 4;16(1):5169. doi: 10.1038/s41467-025-60462-5.

9

The Experience-Experience Gap: Distributional Learning Is Associated with a Divergence of Preferences from Estimations.

Res Sq. 2025 Apr 10:rs.3.rs-6282612. doi: 10.21203/rs.3.rs-6282612/v1.

10

Early versus late noise differentially enhances or degrades context-dependent choice.

Nat Commun. 2025 Apr 23;16(1):3828. doi: 10.1038/s41467-025-59140-3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

文档翻译

学术文献翻译模型，支持多种主流文档格式。