Potjans Wiebke, Morrison Abigail, Diesmann Markus
Computational Neuroscience Group, RIKEN Brain Science Institute, Wako City, Saitama 351-0198, Japan.
Neural Comput. 2009 Feb;21(2):301-39. doi: 10.1162/neco.2008.08-07-593.
The ability to adapt behavior to maximize reward as a result of interactions with the environment is crucial for the survival of any higher organism. In the framework of reinforcement learning, temporal-difference learning algorithms provide an effective strategy for such goal-directed adaptation, but it is unclear to what extent these algorithms are compatible with neural computation. In this article, we present a spiking neural network model that implements actor-critic temporal-difference learning by combining local plasticity rules with a global reward signal. The network is capable of solving a nontrivial gridworld task with sparse rewards. We derive a quantitative mapping of plasticity parameters and synaptic weights to the corresponding variables in the standard algorithmic formulation and demonstrate that the network learns with a similar speed to its discrete time counterpart and attains the same equilibrium performance.
由于与环境的相互作用而使行为适应以最大化奖励的能力对于任何高等生物的生存至关重要。在强化学习框架中,时间差分学习算法为这种目标导向的适应提供了一种有效策略,但尚不清楚这些算法在多大程度上与神经计算兼容。在本文中,我们提出了一种脉冲神经网络模型,该模型通过将局部可塑性规则与全局奖励信号相结合来实现行为-评判时间差分学习。该网络能够解决具有稀疏奖励的非平凡网格世界任务。我们推导了可塑性参数和突触权重与标准算法公式中相应变量的定量映射,并证明该网络的学习速度与其离散时间对应物相似,且达到相同的平衡性能。