Huertas Marco A, Schwettmann Sarah E, Shouval Harel Z
Department of Neurobiology and Anatomy, University of Texas Medical School Houston, TX, USA.
Department of Computational and Applied Mathematics, Rice UniversityHouston, TX, USA; Department of Brain and Cognitive Sciences, Massachusetts Institute of TechnologyCambridge, MA, USA.
Front Synaptic Neurosci. 2016 Dec 15;8:37. doi: 10.3389/fnsyn.2016.00037. eCollection 2016.
The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL) refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment), and how to stop learning once the target behaviors are attained (stopping rule). To address the first problem synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although, these were mere theoretical constructs, recent experiments have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward nucleus cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP) and one for long-term depression (LTD), each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although, the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different neuromodulators for expressing the LTP and LTD traces? Here we expand on our previous model to include several neuromodulators, and illustrate through various examples how different these contribute to learning reward-timing within a wide set of training paradigms and propose further roles that multiple neuromodulators can play in encoding additional information of the rewarding signal.
最大化奖励并避免惩罚的能力对动物生存至关重要。强化学习(RL)指生物或人工系统用于学习如何基于过去的经验最大化奖励或避免负面结果的算法。虽然强化学习在机器学习中也很重要,但生物机制所遇到的机械约束类型可能与人工系统不同。强化学习面临的两个主要问题是如何将刺激与延迟的强化信号相关联(时间信用分配),以及一旦达到目标行为如何停止学习(停止规则)。为了解决第一个问题,引入了突触资格痕迹,弥合了刺激与其奖励之间的时间差距。尽管这些只是理论构想,但最近的实验已提供了它们存在的证据。这些实验还表明,特定神经调节剂的存在会将痕迹转化为突触效能的变化。停止规则的机械实现通常假定奖励核受到抑制;然而,最近的实验结果表明,即使在无法抑制奖励核的设置中,学习也会在适当的网络状态下终止。为了描述一种解决时间信用分配问题并实现生物学上合理的停止规则的学习规则,我们提出了一个基于两个独立突触资格痕迹的模型,一个用于长时程增强(LTP),一个用于长时程抑制(LTD),每个痕迹都遵循不同的动力学并具有不同的有效幅度。该模型已被证明能在循环网络中成功产生稳定的学习。尽管该模型假定存在单一神经调节剂,但有证据表明存在不同的神经调节剂来表达不同的痕迹。不同神经调节剂在表达LTP和LTD痕迹中可能起什么作用?在这里,我们扩展了之前的模型以纳入多种神经调节剂,并通过各种示例说明它们在广泛的训练范式中对学习奖励时间的贡献有何不同,并提出多种神经调节剂可在编码奖励信号的额外信息中发挥的进一步作用。