基于分布式表示的时间差分强化学习。

Temporal-difference reinforcement learning with distributed representations.

机构信息

Department of Neuroscience, University of Minnesota, Minneapolis, Minnesota, United States of America.

出版信息

PLoS One. 2009 Oct 20;4(10):e7362. doi: 10.1371/journal.pone.0007362.

DOI:10.1371/journal.pone.0007362

PMID:19841749

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2760757/

Abstract

Temporal-difference (TD) algorithms have been proposed as models of reinforcement learning (RL). We examine two issues of distributed representation in these TD algorithms: distributed representations of belief and distributed discounting factors. Distributed representation of belief allows the believed state of the world to distribute across sets of equivalent states. Distributed exponential discounting factors produce hyperbolic discounting in the behavior of the agent itself. We examine these issues in the context of a TD RL model in which state-belief is distributed over a set of exponentially-discounting "micro-Agents", each of which has a separate discounting factor (gamma). Each microAgent maintains an independent hypothesis about the state of the world, and a separate value-estimate of taking actions within that hypothesized state. The overall agent thus instantiates a flexible representation of an evolving world-state. As with other TD models, the value-error (delta) signal within the model matches dopamine signals recorded from animals in standard conditioning reward-paradigms. The distributed representation of belief provides an explanation for the decrease in dopamine at the conditioned stimulus seen in overtrained animals, for the differences between trace and delay conditioning, and for transient bursts of dopamine seen at movement initiation. Because each microAgent also includes its own exponential discounting factor, the overall agent shows hyperbolic discounting, consistent with behavioral experiments.

摘要

时间差分 (TD) 算法已被提议作为强化学习 (RL) 的模型。我们检查了这些 TD 算法中的两个分布式表示问题：信念的分布式表示和分布式折扣因素。信念的分布式表示允许所相信的世界状态分布在一组等效状态上。分布式指数折扣因素在代理自身的行为中产生双曲线折扣。我们在一个 TD RL 模型的背景下检查这些问题，其中状态-信念分布在一组指数折扣的“微代理”上，每个代理都有一个单独的折扣因素 (γ)。每个微代理对世界的状态都有一个独立的假设，并对在假设状态下采取行动有一个单独的价值估计。因此，总体代理实例化了一个不断发展的世界状态的灵活表示。与其他 TD 模型一样，模型中的价值误差 (δ) 信号与在标准条件奖励范式中从动物记录的多巴胺信号相匹配。信念的分布式表示为过度训练动物中在条件刺激下看到的多巴胺减少、痕迹和延迟条件之间的差异以及在运动启动时看到的多巴胺短暂爆发提供了一种解释。由于每个微代理还包括其自己的指数折扣因素，因此整体代理表现出双曲线折扣，与行为实验一致。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c5ed/2760757/dd6b13765266/pone.0007362.g001.jpg

相似文献

Temporal-difference reinforcement learning with distributed representations.

PLoS One. 2009 Oct 20;4(10):e7362. doi: 10.1371/journal.pone.0007362.

Predictive representations can link model-based reinforcement learning to model-free mechanisms.

PLoS Comput Biol. 2017 Sep 25;13(9):e1005768. doi: 10.1371/journal.pcbi.1005768. eCollection 2017 Sep.

Reward-predictive representations generalize across tasks in reinforcement learning.

PLoS Comput Biol. 2020 Oct 15;16(10):e1008317. doi: 10.1371/journal.pcbi.1008317. eCollection 2020 Oct.

Evaluating the TD model of classical conditioning.

Learn Behav. 2012 Sep;40(3):305-19. doi: 10.3758/s13420-012-0082-6.

Navigating complex decision spaces: Problems and paradigms in sequential choice.

Psychol Bull. 2014 Mar;140(2):466-86. doi: 10.1037/a0033455. Epub 2013 Jul 8.

Learning to express reward prediction error-like dopaminergic activity requires plastic representations of time.

Nat Commun. 2024 Jul 12;15(1):5856. doi: 10.1038/s41467-024-50205-3.

Mesolimbic dopamine adapts the rate of learning from action.

Nature. 2023 Feb;614(7947):294-302. doi: 10.1038/s41586-022-05614-z. Epub 2023 Jan 18.

Reinforcement learning using a continuous time actor-critic framework with spiking neurons.

PLoS Comput Biol. 2013 Apr;9(4):e1003024. doi: 10.1371/journal.pcbi.1003024. Epub 2013 Apr 11.

Dopaminergic Modulation of Human Intertemporal Choice: A Diffusion Model Analysis Using the D2-Receptor Antagonist Haloperidol.

J Neurosci. 2020 Oct 7;40(41):7936-7948. doi: 10.1523/JNEUROSCI.0592-20.2020. Epub 2020 Sep 18.

A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task.

Neuroscience. 1999;91(3):871-90. doi: 10.1016/s0306-4522(98)00697-6.

引用本文的文献

A multidimensional distributional map of future reward in dopamine neurons.

Nature. 2025 Jun;642(8068):691-699. doi: 10.1038/s41586-025-09089-6. Epub 2025 Jun 4.

A feature-specific prediction error model explains dopaminergic heterogeneity.

Nat Neurosci. 2024 Aug;27(8):1574-1586. doi: 10.1038/s41593-024-01689-1. Epub 2024 Jul 3.

Dopamine transients follow a striatal gradient of reward time horizons.

Nat Neurosci. 2024 Apr;27(4):737-746. doi: 10.1038/s41593-023-01566-3. Epub 2024 Feb 6.

Exponential history integration with diverse temporal scales in retrosplenial cortex supports hyperbolic behavior.

Sci Adv. 2023 Dec;9(48):eadj4897. doi: 10.1126/sciadv.adj4897. Epub 2023 Nov 29.

Multi-timescale reinforcement learning in the brain.

bioRxiv. 2023 Nov 14:2023.11.12.566754. doi: 10.1101/2023.11.12.566754.

Learning temporal relationships between symbols with Laplace Neural Manifolds.

ArXiv. 2024 Sep 22:arXiv:2302.10163v4.

Challenging social media threats using collective well-being-aware recommendation algorithms and an educational virtual companion.

Front Artif Intell. 2023 Jan 9;5:654930. doi: 10.3389/frai.2022.654930. eCollection 2022.

Neuroprotection in late life attention-deficit/hyperactivity disorder: A review of pharmacotherapy and phenotype across the lifespan.

Front Hum Neurosci. 2022 Sep 26;16:938501. doi: 10.3389/fnhum.2022.938501. eCollection 2022.

Predicting the Future With a Scale-Invariant Temporal Memory for the Past.

Neural Comput. 2022 Feb 17;34(3):642-685. doi: 10.1162/neco_a_01475.

Elife. 2019 Sep 18;8:e48429. doi: 10.7554/eLife.48429.

本文引用的文献

Internally generated cell assembly sequences in the rat hippocampus.

Science. 2008 Sep 5;321(5894):1322-7. doi: 10.1126/science.1159775.

The temporal precision of reward prediction in dopamine neurons.

Nat Neurosci. 2008 Aug;11(8):966-73. doi: 10.1038/nn.2159.

Stimulus representation and the timing of reward-prediction errors in models of the dopamine system.

Neural Comput. 2008 Dec;20(12):3034-54. doi: 10.1162/neco.2008.11-07-654.

Low-serotonin levels increase delayed reward discounting in humans.

J Neurosci. 2008 Apr 23;28(17):4528-32. doi: 10.1523/JNEUROSCI.4982-07.2008.

Is a bird in the hand worth two in the future? The neuroeconomics of intertemporal decision-making.

Prog Neurobiol. 2008 Mar;84(3):284-315. doi: 10.1016/j.pneurobio.2007.11.004. Epub 2007 Dec 7.

Dopamine release is heterogeneous within microenvironments of the rat nucleus accumbens.

Eur J Neurosci. 2007 Oct;26(7):2046-54. doi: 10.1111/j.1460-9568.2007.05772.x. Epub 2007 Sep 14.

Dorsal, ventral, and complete excitotoxic lesions of the hippocampus in rats failed to impair appetitive trace conditioning.

Behav Brain Res. 2007 Dec 11;185(1):9-20. doi: 10.1016/j.bbr.2007.07.004. Epub 2007 Jul 12.

Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling.

Psychol Rev. 2007 Jul;114(3):784-805. doi: 10.1037/0033-295X.114.3.784.

Statistics of midbrain dopamine neuron spike trains in the awake primate.

J Neurophysiol. 2007 Sep;98(3):1428-39. doi: 10.1152/jn.01140.2006. Epub 2007 Jul 5.

Multiple model-based reinforcement learning explains dopamine neuronal activity.

Neural Netw. 2007 Aug;20(6):668-75. doi: 10.1016/j.neunet.2007.04.028. Epub 2007 Jun 6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于分布式表示的时间差分强化学习。

Temporal-difference reinforcement learning with distributed representations.

机构信息

Department of Neuroscience, University of Minnesota, Minneapolis, Minnesota, United States of America.

出版信息

PLoS One. 2009 Oct 20;4(10):e7362. doi: 10.1371/journal.pone.0007362.

DOI:10.1371/journal.pone.0007362

PMID:19841749

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2760757/

Abstract

摘要

基于分布式表示的时间差分强化学习。

Temporal-difference reinforcement learning with distributed representations.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

基于分布式表示的时间差分强化学习。

Temporal-difference reinforcement learning with distributed representations.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献