IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):6178-6190. doi: 10.1109/TNNLS.2018.2826721. Epub 2018 May 8.
Reinforcement learning (RL) has recently regained popularity with major achievements such as beating the European game of Go champion. Here, for the first time, we show that RL can be used efficiently to train a spiking neural network (SNN) to perform object recognition in natural images without using an external classifier. We used a feedforward convolutional SNN and a temporal coding scheme where the most strongly activated neurons fire first, while less activated ones fire later, or not at all. In the highest layers, each neuron was assigned to an object category, and it was assumed that the stimulus category was the category of the first neuron to fire. If this assumption was correct, the neuron was rewarded, i.e., spike-timing-dependent plasticity (STDP) was applied, which reinforced the neuron's selectivity. Otherwise, anti-STDP was applied, which encouraged the neuron to learn something else. As demonstrated on various image data sets (Caltech, ETH-80, and NORB), this reward-modulated STDP (R-STDP) approach has extracted particularly discriminative visual features, whereas classic unsupervised STDP extracts any feature that consistently repeats. As a result, R-STDP has outperformed STDP on these data sets. Furthermore, R-STDP is suitable for online learning and can adapt to drastic changes such as label permutations. Finally, it is worth mentioning that both feature extraction and classification were done with spikes, using at most one spike per neuron. Thus, the network is hardware friendly and energy efficient.
强化学习 (RL) 最近在重大成就方面重新受到关注,例如击败欧洲围棋冠军。在这里,我们首次展示了 RL 可以有效地用于训练尖峰神经网络 (SNN),以便在不使用外部分类器的情况下在自然图像中执行对象识别。我们使用了前馈卷积 SNN 和时间编码方案,其中最强激活的神经元首先发射,而较弱激活的神经元则稍后发射,或者根本不发射。在最高层,每个神经元都被分配到一个对象类别,并且假设刺激类别是第一个发射神经元的类别。如果这个假设是正确的,那么神经元就会得到奖励,即应用了依赖于尖峰时间的可塑性 (STDP),这增强了神经元的选择性。否则,应用抗 STDP,这鼓励神经元学习其他东西。正如在各种图像数据集(Caltech、ETH-80 和 NORB)上所证明的那样,这种奖励调制的 STDP(R-STDP)方法提取了特别有区别的视觉特征,而经典的无监督 STDP 则提取任何一致重复的特征。因此,R-STDP 在这些数据集上的性能优于 STDP。此外,R-STDP 适用于在线学习,并且可以适应标签排列等剧烈变化。最后,值得一提的是,特征提取和分类都是使用尖峰完成的,每个神经元最多使用一个尖峰。因此,该网络对硬件友好且节能。