• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

序列决策中的单次学习和行为资格痕迹。

One-shot learning and behavioral eligibility traces in sequential decision making.

机构信息

Brain-Mind-Institute, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.

School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.

出版信息

Elife. 2019 Nov 11;8:e47463. doi: 10.7554/eLife.47463.

DOI:10.7554/eLife.47463
PMID:31709980
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6897511/
Abstract

In many daily tasks, we make multiple decisions before reaching a goal. In order to learn such sequences of decisions, a mechanism to link earlier actions to later reward is necessary. Reinforcement learning (RL) theory suggests two classes of algorithms solving this credit assignment problem: In classic temporal-difference learning, earlier actions receive reward information only after multiple repetitions of the task, whereas models with eligibility traces reinforce entire sequences of actions from a single experience (one-shot). Here, we show one-shot learning of sequences. We developed a novel paradigm to observe which actions and states along a multi-step sequence are reinforced after a single reward. By focusing our analysis on those states for which RL with and without eligibility trace make qualitatively distinct predictions, we find direct behavioral (choice probability) and physiological (pupil dilation) signatures of reinforcement learning with eligibility trace across multiple sensory modalities.

摘要

在许多日常任务中,我们在达到目标之前会做出多次决策。为了学习这种决策序列,需要有一种将早期行动与后期奖励联系起来的机制。强化学习(RL)理论提出了两类解决这种信用分配问题的算法:在经典的时间差分学习中,早期的行动只有在多次重复任务后才会收到奖励信息,而具有资格痕迹的模型则会从单次体验(单次)中强化整个动作序列。在这里,我们展示了序列的单次学习。我们开发了一种新的范例,观察在单次奖励后,沿着多步骤序列的哪些动作和状态得到了强化。通过将我们的分析集中在那些具有和不具有资格痕迹的 RL 做出定性不同预测的状态上,我们发现了具有资格痕迹的强化学习在多个感觉模态中的直接行为(选择概率)和生理(瞳孔扩张)特征。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/0bf442e6ce0c/elife-47463-fig11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/88e6812bd194/elife-47463-fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/a49f9b0d3228/elife-47463-fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/152859a9215a/elife-47463-fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/800be9c29ef4/elife-47463-fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/cfd0bc30ba2e/elife-47463-fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/445b03379054/elife-47463-fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/9739c0d6a19c/elife-47463-fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/5f83c420f6fd/elife-47463-fig8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/0b91e66e2beb/elife-47463-fig9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/ce01ed01b4c5/elife-47463-fig10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/0bf442e6ce0c/elife-47463-fig11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/88e6812bd194/elife-47463-fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/a49f9b0d3228/elife-47463-fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/152859a9215a/elife-47463-fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/800be9c29ef4/elife-47463-fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/cfd0bc30ba2e/elife-47463-fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/445b03379054/elife-47463-fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/9739c0d6a19c/elife-47463-fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/5f83c420f6fd/elife-47463-fig8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/0b91e66e2beb/elife-47463-fig9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/ce01ed01b4c5/elife-47463-fig10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3fc/6897511/0bf442e6ce0c/elife-47463-fig11.jpg

相似文献

1
One-shot learning and behavioral eligibility traces in sequential decision making.序列决策中的单次学习和行为资格痕迹。
Elife. 2019 Nov 11;8:e47463. doi: 10.7554/eLife.47463.
2
How pupil responses track value-based decision-making during and after reinforcement learning.瞳孔反应如何在强化学习期间和之后跟踪基于价值的决策。
PLoS Comput Biol. 2018 Nov 30;14(11):e1006632. doi: 10.1371/journal.pcbi.1006632. eCollection 2018 Nov.
3
Multiple memory systems as substrates for multiple decision systems.多种记忆系统作为多种决策系统的基础。
Neurobiol Learn Mem. 2015 Jan;117:4-13. doi: 10.1016/j.nlm.2014.04.014. Epub 2014 May 15.
4
Spatio-temporal credit assignment in neuronal population learning.神经元群体学习中的时空信用分配。
PLoS Comput Biol. 2011 Jun;7(6):e1002092. doi: 10.1371/journal.pcbi.1002092. Epub 2011 Jun 30.
5
Credit Assignment in a Motor Decision Making Task Is Influenced by Agency and Not Sensory Prediction Errors.在一项运动决策任务中,信用分配受机构影响,而不受感官预测误差影响。
J Neurosci. 2018 May 9;38(19):4521-4530. doi: 10.1523/JNEUROSCI.3601-17.2018. Epub 2018 Apr 12.
6
Learning from delayed feedback: neural responses in temporal credit assignment.从延迟反馈中学习:时间信用分配中的神经反应。
Cogn Affect Behav Neurosci. 2011 Jun;11(2):131-43. doi: 10.3758/s13415-011-0027-0.
7
Computational noise in reward-guided learning drives behavioral variability in volatile environments.奖励导向学习中的计算噪声驱动易变环境中的行为可变性。
Nat Neurosci. 2019 Dec;22(12):2066-2077. doi: 10.1038/s41593-019-0518-9. Epub 2019 Oct 28.
8
Navigating complex decision spaces: Problems and paradigms in sequential choice.导航复杂决策空间:序列选择中的问题和范式。
Psychol Bull. 2014 Mar;140(2):466-86. doi: 10.1037/a0033455. Epub 2013 Jul 8.
9
Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making.在基于奖励的决策过程中,人类纹状体中的强化学习信号可区分学习者和非学习者。
J Neurosci. 2007 Nov 21;27(47):12860-7. doi: 10.1523/JNEUROSCI.2496-07.2007.
10
Novelty is not surprise: Human exploratory and adaptive behavior in sequential decision-making.新颖性不是惊喜:人类在序列决策中的探索和适应行为。
PLoS Comput Biol. 2021 Jun 3;17(6):e1009070. doi: 10.1371/journal.pcbi.1009070. eCollection 2021 Jun.

引用本文的文献

1
A State-Transition-Free Delayed-Feedback Task Elicits Heterogeneous Human Responses.一项无状态转换的延迟反馈任务引发了人类的异质性反应。
J Cogn. 2025 Jul 14;8(1):39. doi: 10.5334/joc.453. eCollection 2025.
2
Exploring the steps of learning: computational modeling of initiatory-actions among individuals with attention-deficit/hyperactivity disorder.探索学习的步骤:注意力缺陷多动障碍个体起始行为的计算建模。
Transl Psychiatry. 2024 Jan 8;14(1):10. doi: 10.1038/s41398-023-02717-7.
3
State-transition-free reinforcement learning in chimpanzees (Pan troglodytes).

本文引用的文献

1
Eligibility Traces and Plasticity on Behavioral Time Scales: Experimental Support of NeoHebbian Three-Factor Learning Rules.行为时间尺度上的资格痕迹和可塑性:新海比尔三因素学习规则的实验支持。
Front Neural Circuits. 2018 Jul 31;12:53. doi: 10.3389/fncir.2018.00053. eCollection 2018.
2
What does dopamine mean?多巴胺是什么意思?
Nat Neurosci. 2018 Jun;21(6):787-793. doi: 10.1038/s41593-018-0152-y. Epub 2018 May 14.
3
Pupil size reflects successful encoding and recall of memory in humans.瞳孔大小反映了人类在记忆编码和提取方面的成功程度。
黑猩猩(Pan troglodytes)中无状态转换的强化学习。
Learn Behav. 2023 Dec;51(4):413-427. doi: 10.3758/s13420-023-00591-3. Epub 2023 Jun 27.
4
A reinforcement-based mechanism for discontinuous learning.基于强化的非连续学习机制。
Proc Natl Acad Sci U S A. 2022 Dec 6;119(49):e2215352119. doi: 10.1073/pnas.2215352119. Epub 2022 Nov 28.
5
A behavioural correlate of the synaptic eligibility trace in the nucleus accumbens.伏隔核中突触合格痕迹的行为相关性。
Sci Rep. 2022 Feb 4;12(1):1921. doi: 10.1038/s41598-022-05637-6.
6
Novelty is not surprise: Human exploratory and adaptive behavior in sequential decision-making.新颖性不是惊喜:人类在序列决策中的探索和适应行为。
PLoS Comput Biol. 2021 Jun 3;17(6):e1009070. doi: 10.1371/journal.pcbi.1009070. eCollection 2021 Jun.
Sci Rep. 2018 Mar 21;8(1):4949. doi: 10.1038/s41598-018-23197-6.
4
Dissociable effects of surprising rewards on learning and memory.意外奖励对学习和记忆的不同影响。
J Exp Psychol Learn Mem Cogn. 2018 Sep;44(9):1430-1443. doi: 10.1037/xlm0000518. Epub 2018 Mar 19.
5
Safe and sensible preprocessing and baseline correction of pupil-size data.安全合理的瞳孔尺寸数据预处理和基线校正。
Behav Res Methods. 2018 Feb;50(1):94-106. doi: 10.3758/s13428-017-1007-2.
6
Behavioral time scale synaptic plasticity underlies CA1 place fields.行为时间尺度的突触可塑性是CA1位置场的基础。
Science. 2017 Sep 8;357(6355):1033-1036. doi: 10.1126/science.aan3846.
7
Reinforcement determines the timing dependence of corticostriatal synaptic plasticity in vivo.强化作用决定了体内皮质纹状体突触可塑性的时间依赖性。
Nat Commun. 2017 Aug 24;8(1):334. doi: 10.1038/s41467-017-00394-x.
8
Sequential neuromodulation of Hebbian plasticity offers mechanism for effective reward-based navigation.赫布可塑性的顺序神经调节为基于奖励的有效导航提供了机制。
Elife. 2017 Jul 10;6:e27756. doi: 10.7554/eLife.27756.
9
Does prediction error drive one-shot declarative learning?预测误差会驱动一次性陈述性学习吗?
J Mem Lang. 2017 Jun;94:149-165. doi: 10.1016/j.jml.2016.11.001.
10
What to Choose Next? A Paradigm for Testing Human Sequential Decision Making.接下来该如何选择?一种测试人类序列决策的范式。
Front Psychol. 2017 Mar 7;8:312. doi: 10.3389/fpsyg.2017.00312. eCollection 2017.