多种神经调质在基于资格迹竞争的强化学习中的作用。

The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces.

作者信息

Huertas Marco A, Schwettmann Sarah E, Shouval Harel Z

机构信息

Department of Neurobiology and Anatomy, University of Texas Medical School Houston, TX, USA.

Department of Computational and Applied Mathematics, Rice UniversityHouston, TX, USA; Department of Brain and Cognitive Sciences, Massachusetts Institute of TechnologyCambridge, MA, USA.

出版信息

Front Synaptic Neurosci. 2016 Dec 15;8:37. doi: 10.3389/fnsyn.2016.00037. eCollection 2016.

DOI:10.3389/fnsyn.2016.00037

PMID:28018206

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5156839/

Abstract

The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL) refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment), and how to stop learning once the target behaviors are attained (stopping rule). To address the first problem synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although, these were mere theoretical constructs, recent experiments have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward nucleus cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP) and one for long-term depression (LTD), each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although, the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different neuromodulators for expressing the LTP and LTD traces? Here we expand on our previous model to include several neuromodulators, and illustrate through various examples how different these contribute to learning reward-timing within a wide set of training paradigms and propose further roles that multiple neuromodulators can play in encoding additional information of the rewarding signal.

摘要

最大化奖励并避免惩罚的能力对动物生存至关重要。强化学习（RL）指生物或人工系统用于学习如何基于过去的经验最大化奖励或避免负面结果的算法。虽然强化学习在机器学习中也很重要，但生物机制所遇到的机械约束类型可能与人工系统不同。强化学习面临的两个主要问题是如何将刺激与延迟的强化信号相关联（时间信用分配），以及一旦达到目标行为如何停止学习（停止规则）。为了解决第一个问题，引入了突触资格痕迹，弥合了刺激与其奖励之间的时间差距。尽管这些只是理论构想，但最近的实验已提供了它们存在的证据。这些实验还表明，特定神经调节剂的存在会将痕迹转化为突触效能的变化。停止规则的机械实现通常假定奖励核受到抑制；然而，最近的实验结果表明，即使在无法抑制奖励核的设置中，学习也会在适当的网络状态下终止。为了描述一种解决时间信用分配问题并实现生物学上合理的停止规则的学习规则，我们提出了一个基于两个独立突触资格痕迹的模型，一个用于长时程增强（LTP），一个用于长时程抑制（LTD），每个痕迹都遵循不同的动力学并具有不同的有效幅度。该模型已被证明能在循环网络中成功产生稳定的学习。尽管该模型假定存在单一神经调节剂，但有证据表明存在不同的神经调节剂来表达不同的痕迹。不同神经调节剂在表达LTP和LTD痕迹中可能起什么作用？在这里，我们扩展了之前的模型以纳入多种神经调节剂，并通过各种示例说明它们在广泛的训练范式中对学习奖励时间的贡献有何不同，并提出多种神经调节剂可在编码奖励信号的额外信息中发挥的进一步作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef1/5156839/e64a23ee64a3/fnsyn-08-00037-g0001.jpg

相似文献

The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces.

Front Synaptic Neurosci. 2016 Dec 15;8:37. doi: 10.3389/fnsyn.2016.00037. eCollection 2016.

Distinct Eligibility Traces for LTP and LTD in Cortical Synapses.

Neuron. 2015 Nov 4;88(3):528-38. doi: 10.1016/j.neuron.2015.09.037. Epub 2015 Oct 22.

Eligibility Traces and Plasticity on Behavioral Time Scales: Experimental Support of NeoHebbian Three-Factor Learning Rules.

Front Neural Circuits. 2018 Jul 31;12:53. doi: 10.3389/fncir.2018.00053. eCollection 2018.

Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity.

Neural Comput. 2007 Jun;19(6):1468-502. doi: 10.1162/neco.2007.19.6.1468.

One-shot learning and behavioral eligibility traces in sequential decision making.

Elife. 2019 Nov 11;8:e47463. doi: 10.7554/eLife.47463.

Short-term memory traces for action bias in human reinforcement learning.

Brain Res. 2007 Jun 11;1153:111-21. doi: 10.1016/j.brainres.2007.03.057. Epub 2007 Mar 24.

Spatio-temporal credit assignment in neuronal population learning.

PLoS Comput Biol. 2011 Jun;7(6):e1002092. doi: 10.1371/journal.pcbi.1002092. Epub 2011 Jun 30.

Statistical mechanics of structural and temporal credit assignment effects on learning in neural networks.

Phys Rev E Stat Nonlin Soft Matter Phys. 2011 May;83(5 Pt 1):051125. doi: 10.1103/PhysRevE.83.051125. Epub 2011 May 20.

Dopamine and serotonin interplay for valence-based spatial learning.

Cell Rep. 2022 Apr 12;39(2):110645. doi: 10.1016/j.celrep.2022.110645.

Spiking neural networks with different reinforcement learning (RL) schemes in a multiagent setting.

Chin J Physiol. 2010 Dec 31;53(6):447-53.

引用本文的文献

Learning to express reward prediction error-like dopaminergic activity requires plastic representations of time.

Nat Commun. 2024 Jul 12;15(1):5856. doi: 10.1038/s41467-024-50205-3.

Astrocyte D1/D5 Dopamine Receptors Govern Non-Hebbian Long-Term Potentiation at Sensory Synapses onto Lamina I Spinoparabrachial Neurons.

J Neurosci. 2024 Aug 7;44(32):e0170242024. doi: 10.1523/JNEUROSCI.0170-24.2024.

Toward reproducible models of sequence learning: replication and analysis of a modular spiking network with reward-based learning.

Front Integr Neurosci. 2023 Jun 15;17:935177. doi: 10.3389/fnint.2023.935177. eCollection 2023.

Norepinephrine potentiates and serotonin depresses visual cortical responses by transforming eligibility traces.

Nat Commun. 2022 Jun 9;13(1):3202. doi: 10.1038/s41467-022-30827-1.

Dopamine and serotonin interplay for valence-based spatial learning.

Cell Rep. 2022 Apr 12;39(2):110645. doi: 10.1016/j.celrep.2022.110645.

Learning precise spatiotemporal sequences via biophysically realistic learning rules in a modular, spiking network.

Elife. 2021 Mar 18;10:e63751. doi: 10.7554/eLife.63751.

Behavioral Time Scale Plasticity of Place Fields: Mathematical Analysis.

Front Comput Neurosci. 2021 Mar 1;15:640235. doi: 10.3389/fncom.2021.640235. eCollection 2021.

本文引用的文献

Rapid signalling in distinct dopaminergic axons during locomotion and reward.

Nature. 2016 Jul 28;535(7613):505-10. doi: 10.1038/nature18942. Epub 2016 Jul 11.

Serotonin neurons in the dorsal raphe nucleus encode reward signals.

Nat Commun. 2016 Jan 28;7:10503. doi: 10.1038/ncomms10503.

Distinct Eligibility Traces for LTP and LTD in Cortical Synapses.

Neuron. 2015 Nov 4;88(3):528-38. doi: 10.1016/j.neuron.2015.09.037. Epub 2015 Oct 22.

A Simple Network Architecture Accounts for Diverse Reward Time Responses in Primary Visual Cortex.

J Neurosci. 2015 Sep 16;35(37):12659-72. doi: 10.1523/JNEUROSCI.0871-15.2015.

Central Cholinergic Neurons Are Rapidly Recruited by Reinforcement Feedback.

Cell. 2015 Aug 27;162(5):1155-68. doi: 10.1016/j.cell.2015.07.057.

Selective activation of a putative reinforcement signal conditions cued interval timing in primary visual cortex.

Curr Biol. 2015 Jun 15;25(12):1551-61. doi: 10.1016/j.cub.2015.04.028. Epub 2015 May 21.

Visually cued action timing in the primary visual cortex.

Neuron. 2015 Apr 8;86(1):319-30. doi: 10.1016/j.neuron.2015.02.043. Epub 2015 Mar 26.

A critical time window for dopamine actions on the structural plasticity of dendritic spines.

Science. 2014 Sep 26;345(6204):1616-20. doi: 10.1126/science.1255514.

Rapid spectrotemporal plasticity in primary auditory cortex during behavior.

J Neurosci. 2014 Mar 19;34(12):4396-408. doi: 10.1523/JNEUROSCI.2799-13.2014.

Association with reward negatively modulates short latency phasic conditioned responses of dorsal raphe nucleus neurons in freely moving rats.

J Neurosci. 2013 Mar 13;33(11):5065-78. doi: 10.1523/JNEUROSCI.5679-12.2013.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

多种神经调质在基于资格迹竞争的强化学习中的作用。

The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces.

作者信息

Huertas Marco A, Schwettmann Sarah E, Shouval Harel Z

机构信息

Department of Neurobiology and Anatomy, University of Texas Medical School Houston, TX, USA.

Department of Computational and Applied Mathematics, Rice UniversityHouston, TX, USA; Department of Brain and Cognitive Sciences, Massachusetts Institute of TechnologyCambridge, MA, USA.

出版信息

Front Synaptic Neurosci. 2016 Dec 15;8:37. doi: 10.3389/fnsyn.2016.00037. eCollection 2016.

DOI:10.3389/fnsyn.2016.00037

PMID:28018206

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5156839/

Abstract

摘要

多种神经调质在基于资格迹竞争的强化学习中的作用。

The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

多种神经调质在基于资格迹竞争的强化学习中的作用。

The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces.

作者信息

机构信息

出版信息