Neurobiology Research Unit, Okinawa Institute of Science and Technology, 1919-1, Tancha, Onna-Son, Kunigami, Okinawa 904-0412, Japan.
Eur J Neurosci. 2012 Apr;35(7):1115-23. doi: 10.1111/j.1460-9568.2012.08055.x.
In the past few decades there has been remarkable convergence of machine learning with neurobiological understanding of reinforcement learning mechanisms, exemplified by temporal difference (TD) learning models. The anatomy of the basal ganglia provides a number of potential substrates for instantiation of the TD mechanism. In contrast to the traditional concept of direct and indirect pathway outputs from the striatum, we emphasize that projection neurons of the striatum are branched and individual striatofugal neurons innervate both globus pallidus externa and globus pallidus interna/substantia nigra (GPi/SNr). This suggests that the GPi/SNr has the necessary inputs to operate as the source of a TD signal. We also discuss the mechanism for the timing processes necessary for learning in the TD framework. The TD framework has been particularly successful in analysing electrophysiogical recordings from dopamine (DA) neurons during learning, in terms of reward prediction error. However, present understanding of the neural control of DA release is limited, and hence the neural mechanisms involved are incompletely understood. Inhibition is very conspicuously present among the inputs to the DA neurons, with inhibitory synapses accounting for the majority of synapses on DA neurons. Furthermore, synchronous firing of the DA neuron population requires disinhibition and excitation to occur together in a coordinated manner. We conclude that the inhibitory circuits impinging directly or indirectly on the DA neurons play a central role in the control of DA neuron activity and further investigation of these circuits may provide important insight into the biological mechanisms of reinforcement learning.
在过去的几十年中,机器学习与强化学习机制的神经生物学理解之间的融合取得了显著的进展,其中以时间差分 (TD) 学习模型为代表。基底神经节的解剖结构为 TD 机制的实现提供了多个潜在的基质。与纹状体的直接和间接途径输出的传统概念相反,我们强调纹状体的投射神经元是分支的,并且单个纹状体传出神经元支配苍白球外和苍白球内/黑质 (GPi/SNr)。这表明 GPi/SNr 具有作为 TD 信号源所需的输入。我们还讨论了在 TD 框架中学习所需的定时过程的机制。TD 框架在分析学习期间多巴胺 (DA) 神经元的电生理记录方面特别成功,就奖励预测误差而言。然而,目前对 DA 释放的神经控制的理解有限,因此涉及的神经机制理解不完整。抑制在 DA 神经元的输入中非常明显,抑制性突触占 DA 神经元上突触的大多数。此外,DA 神经元群体的同步放电需要抑制和兴奋以协调的方式一起发生。我们得出结论,直接或间接影响 DA 神经元的抑制性回路在控制 DA 神经元活动中起着核心作用,对这些回路的进一步研究可能为强化学习的生物学机制提供重要的见解。