Fan Hongxiang, Liu Shuanglong, Que Zhiqiang, Niu Xinyu, Luk Wayne
IEEE Trans Neural Netw Learn Syst. 2023 Aug;34(8):4473-4487. doi: 10.1109/TNNLS.2021.3116302. Epub 2023 Aug 4.
Over the past few years, 2-D convolutional neural networks (CNNs) have demonstrated their great success in a wide range of 2-D computer vision applications, such as image classification and object detection. At the same time, 3-D CNNs, as a variant of 2-D CNNs, have shown their excellent ability to analyze 3-D data, such as video and geometric data. However, the heavy algorithmic complexity of 2-D and 3-D CNNs imposes a substantial overhead over the speed of these networks, which limits their deployment in real-life applications. Although various domain-specific accelerators have been proposed to address this challenge, most of them only focus on accelerating 2-D CNNs, without considering their computational efficiency on 3-D CNNs. In this article, we propose a unified hardware architecture to accelerate both 2-D and 3-D CNNs with high hardware efficiency. Our experiments demonstrate that the proposed accelerator can achieve up to 92.4% and 85.2% multiply-accumulate efficiency on 2-D and 3-D CNNs, respectively. To improve the hardware performance, we propose a hardware-friendly quantization approach called static block floating point (BFP), which eliminates the frequent representation conversions required in traditional dynamic BFP arithmetic. Comparing with the integer linear quantization using zero-point, the static BFP quantization can decrease the logic resource consumption of the convolutional kernel design by nearly 50% on a field-programmable gate array (FPGA). Without time-consuming retraining, the proposed static BFP quantization is able to quantize the precision to 8-bit mantissa with negligible accuracy loss. As different CNNs on our reconfigurable system require different hardware and software parameters to achieve optimal hardware performance and accuracy, we also propose an automatic tool for parameter optimization. Based on our hardware design and optimization, we demonstrate that the proposed accelerator can achieve 3.8-5.6 times higher energy efficiency than graphics processing unit (GPU) implementation. Comparing with the state-of-the-art FPGA-based accelerators, our design achieves higher generality and up to 1.4-2.2 times higher resource efficiency on both 2-D and 3-D CNNs.
在过去几年中,二维卷积神经网络(CNN)在广泛的二维计算机视觉应用中展现出了巨大的成功,如图像分类和目标检测。与此同时,作为二维CNN的变体,三维CNN在分析三维数据(如视频和几何数据)方面表现出了卓越的能力。然而,二维和三维CNN繁重的算法复杂度给这些网络的速度带来了巨大开销,这限制了它们在实际应用中的部署。尽管已经提出了各种特定领域的加速器来应对这一挑战,但大多数只专注于加速二维CNN,而没有考虑它们在三维CNN上的计算效率。在本文中,我们提出了一种统一的硬件架构,以高硬件效率加速二维和三维CNN。我们的实验表明,所提出的加速器在二维和三维CNN上分别可实现高达92.4%和85.2%的乘加效率。为了提高硬件性能,我们提出了一种名为静态块浮点(BFP)的硬件友好量化方法,该方法消除了传统动态BFP算法中频繁的表示转换。与使用零点的整数线性量化相比,静态BFP量化在现场可编程门阵列(FPGA)上可将卷积核设计的逻辑资源消耗降低近50%。在无需耗时重新训练的情况下,所提出的静态BFP量化能够将精度量化为8位尾数,精度损失可忽略不计。由于我们可重构系统上的不同CNN需要不同的硬件和软件参数来实现最佳硬件性能和精度,我们还提出了一种参数优化自动工具。基于我们的硬件设计和优化,我们证明所提出的加速器比图形处理单元(GPU)实现的能效高3.8至5.6倍。与基于FPGA的最新加速器相比,我们的设计在二维和三维CNN上都具有更高的通用性和高达1.4至2.2倍的资源效率。