Putra Rachmad Vidya Wicaksana, Hanif Muhammad Abdullah, Shafique Muhammad
Embedded Computing Systems, Institute of Computer Engineering, Technische Universität Wien (TU Wien), Vienna, Austria.
eBrain Lab, Division of Engineering, New York University Abu Dhabi (NYUAD), Abu Dhabi, United Arab Emirates.
Front Neurosci. 2023 Apr 12;17:1159440. doi: 10.3389/fnins.2023.1159440. eCollection 2023.
To maximize the performance and energy efficiency of Spiking Neural Network (SNN) processing on resource-constrained embedded systems, specialized hardware accelerators/chips are employed. However, these SNN chips may suffer from permanent faults which can affect the functionality of weight memory and neuron behavior, thereby causing potentially significant accuracy degradation and system malfunctioning. Such permanent faults may come from manufacturing defects during the fabrication process, and/or from device/transistor damages (e.g., due to wear out) during the run-time operation. However, the impact of permanent faults in SNN chips and the respective mitigation techniques have not been thoroughly investigated yet. Toward this, we propose RescueSNN, a novel methodology to mitigate permanent faults in the compute engine of SNN chips retraining, thereby significantly cutting down the design time and retraining costs, while maintaining the throughput and quality. The key ideas of our RescueSNN methodology are (1) analyzing the characteristics of SNN under permanent faults; (2) leveraging this analysis to improve the SNN fault-tolerance through effective fault-aware mapping (FAM); and (3) devising lightweight hardware enhancements to support FAM. Our FAM technique leverages the fault map of SNN compute engine for (i) minimizing weight corruption when mapping weight bits on the faulty memory cells, and (ii) selectively employing faulty neurons that do not cause significant accuracy degradation to maintain accuracy and throughput, while considering the SNN operations and processing dataflow. The experimental results show that our RescueSNN improves accuracy by up to 80% while maintaining the throughput reduction below 25% in high fault rate (e.g., 0.5 of the potential fault locations), as compared to running SNNs on the faulty chip without mitigation. In this manner, the embedded systems that employ RescueSNN-enhanced chips can efficiently ensure reliable executions against permanent faults during their operational lifetime.
为了在资源受限的嵌入式系统上最大化脉冲神经网络(SNN)处理的性能和能源效率,人们采用了专门的硬件加速器/芯片。然而,这些SNN芯片可能会遭受永久性故障,这会影响权重存储器的功能和神经元行为,从而导致潜在的显著精度下降和系统故障。这种永久性故障可能源于制造过程中的制造缺陷,和/或运行时操作期间的器件/晶体管损坏(例如,由于磨损)。然而,SNN芯片中永久性故障的影响以及相应的缓解技术尚未得到充分研究。为此,我们提出了RescueSNN,这是一种减轻SNN芯片计算引擎中永久性故障的新颖方法——重新训练,从而显著缩短设计时间和重新训练成本,同时保持吞吐量和质量。我们的RescueSNN方法的关键思想是:(1)分析永久性故障下SNN的特征;(2)利用这一分析通过有效的故障感知映射(FAM)提高SNN的容错能力;(3)设计轻量级硬件增强来支持FAM。我们的FAM技术利用SNN计算引擎的故障映射来:(i)在将权重位映射到故障存储单元时最小化权重损坏,以及(ii)在考虑SNN操作和处理数据流的同时,有选择地使用不会导致显著精度下降的故障神经元来保持精度和吞吐量。实验结果表明,与在未缓解的故障芯片上运行SNN相比,我们的RescueSNN在高故障率(例如,潜在故障位置的0.5)下将精度提高了80%,同时将吞吐量降低保持在25%以下。通过这种方式,采用RescueSNN增强芯片的嵌入式系统能够在其运行寿命期间有效地确保针对永久性故障的可靠执行。