Vladimirskiy Boris B, Vasilaki Eleni, Urbanczik Robert, Senn Walter
Department of Physiology, University of Bern, Switzerland.
Biol Cybern. 2009 Apr;100(4):319-30. doi: 10.1007/s00422-009-0305-x. Epub 2009 Apr 10.
Reinforcement learning in neural networks requires a mechanism for exploring new network states in response to a single, nonspecific reward signal. Existing models have introduced synaptic or neuronal noise to drive this exploration. However, those types of noise tend to almost average out-precluding or significantly hindering learning -when coding in neuronal populations or by mean firing rates is considered. Furthermore, careful tuning is required to find the elusive balance between the often conflicting demands of speed and reliability of learning. Here we show that there is in fact no need to rely on intrinsic noise. Instead, ongoing synaptic plasticity triggered by the naturally occurring online sampling of a stimulus out of an entire stimulus set produces enough fluctuations in the synaptic efficacies for successful learning. By combining stimulus sampling with reward attenuation, we demonstrate that a simple Hebbian-like learning rule yields the performance that is very close to that of primates on visuomotor association tasks. In contrast, learning rules based on intrinsic noise (node and weight perturbation) are markedly slower. Furthermore, the performance advantage of our approach persists for more complex tasks and network architectures. We suggest that stimulus sampling and reward attenuation are two key components of a framework by which any single-cell supervised learning rule can be converted into a reinforcement learning rule for networks without requiring any intrinsic noise source.
神经网络中的强化学习需要一种机制,以响应单一的、非特定的奖励信号来探索新的网络状态。现有模型引入了突触或神经元噪声来驱动这种探索。然而,当考虑在神经元群体中编码或通过平均放电率进行编码时,这些类型的噪声往往几乎会相互抵消,从而排除或显著阻碍学习。此外,需要仔细调整才能在学习速度和可靠性这两个经常相互冲突的要求之间找到难以捉摸的平衡。在这里,我们表明实际上无需依赖内在噪声。相反,从整个刺激集中对刺激进行自然发生的在线采样所触发的持续突触可塑性,会在突触效能中产生足够的波动以实现成功学习。通过将刺激采样与奖励衰减相结合,我们证明了一个简单的类似赫布学习规则所产生的性能与灵长类动物在视觉运动关联任务上的性能非常接近。相比之下,基于内在噪声(节点和权重扰动)的学习规则明显更慢。此外,我们方法的性能优势在更复杂的任务和网络架构中依然存在。我们认为,刺激采样和奖励衰减是一个框架的两个关键组成部分,通过该框架,任何单细胞监督学习规则都可以转换为网络的强化学习规则,而无需任何内在噪声源。