一种基于脉冲时间依赖可塑性的强化学习实现。

An implementation of reinforcement learning based on spike timing dependent plasticity.

作者信息

Roberts Patrick D, Santiago Roberto A, Lafferriere Gerardo

机构信息

Department of Science and Engineering, Oregon Health and Science University, Portland, OR 97239, USA.

出版信息

Biol Cybern. 2008 Dec;99(6):517-23. doi: 10.1007/s00422-008-0265-6. Epub 2008 Oct 22.

DOI:10.1007/s00422-008-0265-6

PMID:18941775

Abstract

An explanatory model is developed to show how synaptic learning mechanisms modeled through spike-timing dependent plasticity (STDP) can result in long-term adaptations consistent with reinforcement learning models. In particular, the reinforcement learning model known as temporal difference (TD) learning has been used to model neuronal behavior in the orbitofrontal cortex (OFC) and ventral tegmental area (VTA) of macaque monkey during reinforcement learning. While some research has observed, empirically, a connection between STDP and TD, there has not been an explanatory model directly connecting TD to STDP. Through analysis of the learning dynamics that results from a general form of a STDP learning rule, the connection between STDP and TD is explained. We further demonstrate that a STDP learning rule drives the spike probability of a reward predicting neuronal population to a stable equilibrium. The equilibrium solution has an increasing slope where the steepness of the slope predicts the probability of the reward, similar to the results from electrophysiological recordings suggesting a different slope that predicts the value of the anticipated reward of Montague and Berns [Neuron 36(2):265-284, 2002]. This connection begins to shed light into more recent data gathered from VTA and OFC which are not well modeled by TD. We suggest that STDP provides the underlying mechanism for explaining reinforcement learning and other higher level perceptual and cognitive function.

摘要

我们开发了一个解释模型，以展示通过尖峰时间依赖可塑性（STDP）建模的突触学习机制如何导致与强化学习模型一致的长期适应性。具体而言，被称为时间差分（TD）学习的强化学习模型已被用于模拟猕猴在强化学习过程中眶额皮质（OFC）和腹侧被盖区（VTA）的神经元行为。虽然一些研究已经通过实验观察到STDP和TD之间的联系，但尚未有直接将TD与STDP联系起来的解释模型。通过分析由STDP学习规则的一般形式产生的学习动态，解释了STDP和TD之间的联系。我们进一步证明，STDP学习规则将奖励预测神经元群体的尖峰概率驱动到一个稳定的平衡点。平衡点处的斜率是增加的，其中斜率的陡度预测奖励的概率，这类似于电生理记录的结果，表明存在不同的斜率来预测Montague和Berns [《神经元》36(2):265 - 284, 2002]预期奖励的价值。这种联系开始为从VTA和OFC收集的最新数据提供启示，而这些数据无法用TD很好地建模。我们认为，STDP为解释强化学习以及其他更高层次的感知和认知功能提供了潜在机制。