School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.
Sensors (Basel). 2022 Nov 6;22(21):8545. doi: 10.3390/s22218545.
Deep neural networks have been deployed in various hardware accelerators, such as graph process units (GPUs), field-program gate arrays (FPGAs), and application specific integrated circuit (ASIC) chips. Normally, a huge amount of computation is required in the inference process, creating significant logic resource overheads. In addition, frequent data accessions between off-chip memory and hardware accelerators create bottlenecks, leading to decline in hardware efficiency. Many solutions have been proposed to reduce hardware overhead and data movements. For example, specific lookup-table (LUT)-based hardware architecture can be used to mitigate computing operation demands. However, typical LUT-based accelerators are affected by computational precision limitation and poor scalability issues. In this paper, we propose a search-based computing scheme based on an LUT solution, which improves computation efficiency by replacing traditional multiplication with a search operation. In addition, the proposed scheme supports different precision multiple-bit widths to meet the needs of different DNN-based applications. We design a reconfigurable computing strategy, which can efficiently adapt to the convolution of different kernel sizes to improve hardware scalability. We implement a search-based architecture, namely SCA, which adopts an on-chip storage mechanism, thus greatly reducing interactions with off-chip memory and alleviating bandwidth pressure. Based on experimental evaluation, the proposed SCA architecture can achieve 92%, 96% and 98% computational utilization for computational precision of 4 bit, 8 bit and 16 bit, respectively. Compared with state-of-the-art LUT-based architecture, the efficiency can be improved four-fold.
深度神经网络已经部署在各种硬件加速器中,如图形处理单元 (GPU)、现场可编程门阵列 (FPGA) 和专用集成电路 (ASIC) 芯片。通常,推理过程需要大量的计算,这会导致大量的逻辑资源开销。此外,片外存储器和硬件加速器之间频繁的数据访问会造成瓶颈,导致硬件效率下降。已经提出了许多解决方案来减少硬件开销和数据移动。例如,可以使用基于特定查找表 (LUT) 的硬件架构来减轻计算操作的需求。然而,典型的基于 LUT 的加速器受到计算精度限制和可扩展性差的问题的影响。在本文中,我们提出了一种基于 LUT 解决方案的搜索计算方案,通过用搜索操作代替传统乘法来提高计算效率。此外,该方案支持不同精度的多位宽度,以满足不同基于 DNN 的应用的需求。我们设计了一种可重构计算策略,能够有效地适应不同核大小的卷积,提高硬件的可扩展性。我们实现了一种基于搜索的架构,即 SCA,它采用片上存储机制,从而大大减少了与片外存储器的交互,缓解了带宽压力。基于实验评估,所提出的 SCA 架构在 4 位、8 位和 16 位计算精度下分别实现了 92%、96%和 98%的计算利用率。与最先进的基于 LUT 的架构相比,效率可以提高四倍。