• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

前瞻性和回顾性时间差分学习。

Prospective and retrospective temporal difference learning.

作者信息

Dayan Peter

机构信息

Gatsby Computational Neuroscience Unit, UCL, London, WC1N 3AR, UK.

出版信息

Network. 2009;20(1):32-46. doi: 10.1080/09548980902759086.

DOI:10.1080/09548980902759086
PMID:19229732
Abstract

A striking recent finding is that monkeys behave maladaptively in a class of tasks in which they know that reward is going to be systematically delayed. This may be explained by a malign Pavlovian influence arising from states with low predicted values. However, by very carefully analyzing behavioral data from such tasks, La Camera and Richmond (2008) observed the additional important characteristic that subjects perform differently on states in the task that are at equal distances from the future reward, depending on what has happened in the recent past. The authors pointed out that this violates the definition of state value in the standard reinforcement learning models that are ubiquitous as accounts of operant and classical conditioned behavior; they suggested and analyzed an alternative temporal difference (TD) model in which past and future are melded. Here, we show that, in fact, a standard TD model can actually exhibit the same behavior, and that this avoids deleterious consequences for choice. At the heart of the model is the average reward per step, which acts as a baseline for measuring immediate rewards. Relatively subtle changes to this baseline occasioned by the past can markedly influence predictions and thus behavior.

摘要

最近一个引人注目的发现是,猴子在一类任务中表现出适应不良,在这类任务中它们知道奖励将会被系统性延迟。这可能是由预测值较低的状态产生的有害巴甫洛夫影响所解释的。然而,通过非常仔细地分析此类任务的行为数据,拉卡梅拉和里士满(2008年)观察到了另一个重要特征,即根据近期发生的情况,受试者在任务中与未来奖励距离相等的状态下表现不同。作者指出,这违反了作为操作性和经典条件性行为解释而普遍存在的标准强化学习模型中状态价值的定义;他们提出并分析了一种替代的时间差分(TD)模型,其中过去和未来相互融合。在这里,我们表明,事实上,一个标准的TD模型实际上可以表现出相同的行为,并且这避免了对选择产生有害后果。该模型的核心是每一步的平均奖励,它作为衡量即时奖励的基线。过去对这个基线产生的相对细微的变化可以显著影响预测,进而影响行为。

相似文献

1
Prospective and retrospective temporal difference learning.前瞻性和回顾性时间差分学习。
Network. 2009;20(1):32-46. doi: 10.1080/09548980902759086.
2
Modeling the violation of reward maximization and invariance in reinforcement schedules.强化程序中奖励最大化和不变性违背的建模
PLoS Comput Biol. 2008 Aug 8;4(8):e1000131. doi: 10.1371/journal.pcbi.1000131.
3
Temporal and probabilistic discounting of rewards in children and adolescents: effects of age and ADHD symptoms.儿童和青少年对奖励的时间和概率折扣:年龄及注意缺陷多动障碍症状的影响
Neuropsychologia. 2006;44(11):2092-103. doi: 10.1016/j.neuropsychologia.2005.10.012. Epub 2005 Nov 21.
4
Brain mechanism of reward prediction under predictable and unpredictable environmental dynamics.可预测和不可预测环境动态下奖励预测的脑机制。
Neural Netw. 2006 Oct;19(8):1233-41. doi: 10.1016/j.neunet.2006.05.039. Epub 2006 Sep 18.
5
Reward-dependent learning in neuronal networks for planning and decision making.用于规划和决策的神经网络中基于奖励的学习。
Prog Brain Res. 2000;126:217-29. doi: 10.1016/S0079-6123(00)26016-0.
6
Theory meets pigeons: the influence of reward-magnitude on discrimination-learning.理论与鸽子:奖励幅度对辨别学习的影响。
Behav Brain Res. 2009 Mar 2;198(1):125-9. doi: 10.1016/j.bbr.2008.10.038. Epub 2008 Nov 8.
7
Dopamine-dependent reinforcement of motor skill learning: evidence from Gilles de la Tourette syndrome.多巴胺依赖的运动技能学习增强:来自 Gilles de la Tourette 综合征的证据。
Brain. 2011 Aug;134(Pt 8):2287-301. doi: 10.1093/brain/awr147. Epub 2011 Jul 3.
8
Learning the opportunity cost of time in a patch-foraging task.在斑块觅食任务中了解时间的机会成本。
Cogn Affect Behav Neurosci. 2015 Dec;15(4):837-53. doi: 10.3758/s13415-015-0350-y.
9
Adaptive learning via selectionism and Bayesianism, Part II: the sequential case.基于选择主义和贝叶斯主义的适应性学习,第二部分:序列情形。
Neural Netw. 2009 Apr;22(3):229-36. doi: 10.1016/j.neunet.2009.03.017. Epub 2009 Apr 5.
10
Statistical mechanics of reward-modulated learning in decision-making networks.决策网络中受奖励调节的学习的统计力学。
Neural Comput. 2012 May;24(5):1230-70. doi: 10.1162/NECO_a_00264. Epub 2012 Feb 1.

引用本文的文献

1
Global reward state affects learning and activity in raphe nucleus and anterior insula in monkeys.全球奖励状态会影响猴子中缝核和前岛叶的学习和活动。
Nat Commun. 2020 Jul 28;11(1):3771. doi: 10.1038/s41467-020-17343-w.
2
Predictive decision making driven by multiple time-linked reward representations in the anterior cingulate cortex.前扣带皮层中多个时间相关奖励表征驱动的预测决策。
Nat Commun. 2016 Aug 1;7:12327. doi: 10.1038/ncomms12327.
3
A Computational Analysis of Aberrant Delay Discounting in Psychiatric Disorders.精神疾病中异常延迟折扣的计算分析。
Front Psychol. 2016 Jan 13;6:1948. doi: 10.3389/fpsyg.2015.01948. eCollection 2015.
4
Anticipation and choice heuristics in the dynamic consumption of pain relief.缓解疼痛动态消费中的预期与选择启发法
PLoS Comput Biol. 2015 Mar 20;11(3):e1004030. doi: 10.1371/journal.pcbi.1004030. eCollection 2015 Mar.
5
Neuromodulation of reward-based learning and decision making in human aging.人类衰老过程中基于奖励的学习和决策的神经调节。
Ann N Y Acad Sci. 2011 Oct;1235:1-17. doi: 10.1111/j.1749-6632.2011.06230.x.
6
An imperfect dopaminergic error signal can drive temporal-difference learning.不完美的多巴胺能误差信号可以驱动时间差分学习。
PLoS Comput Biol. 2011 May;7(5):e1001133. doi: 10.1371/journal.pcbi.1001133. Epub 2011 May 12.
7
Pavlovian-instrumental interaction in 'observing behavior'.观察行为中的巴甫洛夫-工具性条件反射相互作用。
PLoS Comput Biol. 2010 Sep 9;6(9):e1000903. doi: 10.1371/journal.pcbi.1000903.