NRC "Kurchatov Institute", Akademika Kurchatova sq., 1 Moscow, Russian Federation.
NRC "Kurchatov Institute", Akademika Kurchatova sq., 1 Moscow, Russian Federation; Russian Technological University "MIREA", Vernadsky av., 78 Moscow, Russian Federation.
Neural Netw. 2023 Sep;166:512-523. doi: 10.1016/j.neunet.2023.07.031. Epub 2023 Jul 31.
Neural networks implemented in memristor-based hardware can provide fast and efficient in-memory computation, but traditional learning methods such as error back-propagation are hardly feasible in it. Spiking neural networks (SNNs) are highly promising in this regard, as their weights can be changed locally in a self-organized manner without the demand for high-precision changes calculated with the use of information almost from the entire network. This problem is rather relevant for solving control tasks with neural-network reinforcement learning methods, as those are highly sensitive to any source of stochasticity in a model initialization, training, or decision-making procedure. This paper presents an online reinforcement learning algorithm in which the change of connection weights is carried out after processing each environment state during interaction-with-environment data generation. Another novel feature of the algorithm is that it is applied to SNNs with memristor-based STDP-like learning rules. The plasticity functions are obtained from real memristors based on poly-p-xylylene and CoFeB-LiNbO nanocomposite, which were experimentally assembled and analyzed. The SNN is comprised of leaky integrate-and-fire neurons. Environmental states are encoded by the timings of input spikes, and the control action is decoded by the first spike. The proposed learning algorithm solves the Cart-Pole benchmark task successfully. This result could be the first step towards implementing a real-time agent learning procedure in a continuous-time environment that can be run on neuromorphic systems with memristive synapses.
基于忆阻器的硬件实现的神经网络可以提供快速高效的内存计算,但传统的学习方法,如误差反向传播,在其中几乎是不可行的。在这方面,脉冲神经网络 (SNN) 非常有前途,因为它们的权重可以在不要求高精度变化的情况下以自组织的方式局部改变,而这些高精度变化是使用几乎来自整个网络的信息计算得出的。对于使用神经网络强化学习方法解决控制任务来说,这个问题非常重要,因为这些方法对模型初始化、训练或决策过程中的任何随机源都非常敏感。本文提出了一种在线强化学习算法,其中在生成与环境交互的数据时,在处理每个环境状态后,进行连接权重的改变。该算法的另一个新颖特征是,它应用于基于忆阻器的 STDP 样学习规则的 SNN。基于聚对二甲苯和 CoFeB-LiNbO 纳米复合材料的真实忆阻器获得了可塑性函数,这些忆阻器经过实验组装和分析。SNN 由漏积分和触发神经元组成。环境状态由输入尖峰的时间编码,控制动作由第一个尖峰解码。所提出的学习算法成功地解决了 Cart-Pole 基准任务。这一结果可能是朝着在连续时间环境中实现实时代理学习过程迈出的第一步,该过程可以在具有忆阻突触的神经形态系统上运行。