IEEE Trans Neural Netw Learn Syst. 2022 Aug;33(8):3974-3987. doi: 10.1109/TNNLS.2021.3055240. Epub 2022 Aug 3.
Due to the huge success and rapid development of convolutional neural networks (CNNs), there is a growing demand for hardware accelerators that accommodate a variety of CNNs to improve their inference latency and energy efficiency, in order to enable their deployment in real-time applications. Among popular platforms, field-programmable gate arrays (FPGAs) have been widely adopted for CNN acceleration because of their capability to provide superior energy efficiency and low-latency processing, while supporting high reconfigurability, making them favorable for accelerating rapidly evolving CNN algorithms. This article introduces a highly customized streaming hardware architecture that focuses on improving the compute efficiency for streaming applications by providing full-stack acceleration of CNNs on FPGAs. The proposed accelerator maps most computational functions, that is, convolutional and deconvolutional layers into a singular unified module, and implements the residual and concatenative connections between the functions with high efficiency, to support the inference of mainstream CNNs with different topologies. This architecture is further optimized through exploiting different levels of parallelism, layer fusion, and fully leveraging digital signal processing blocks (DSPs). The proposed accelerator has been implemented on Intel's Arria 10 GX1150 hardware and evaluated with a wide range of benchmark models. The results demonstrate a high performance of over 1.3 TOP/s of throughput, up to 97% of compute [multiply-accumulate (MAC)] efficiency, which outperforms the state-of-the-art FPGA accelerators.
由于卷积神经网络(CNNs)的巨大成功和快速发展,人们对能够适应各种 CNN 的硬件加速器的需求日益增长,以提高其推断延迟和能源效率,从而能够在实时应用中部署。在流行的平台中,现场可编程门阵列(FPGA)因其提供卓越的能源效率和低延迟处理能力而被广泛用于 CNN 加速,同时支持高度可重构性,使其成为加速快速发展的 CNN 算法的理想选择。本文介绍了一种高度定制的流媒体硬件架构,该架构通过在 FPGAs 上提供 CNN 的全栈加速,专注于通过提供全栈加速来提高流媒体应用的计算效率。所提出的加速器将大多数计算功能(即卷积和反卷积层)映射到一个单一的统一模块中,并通过高效地实现函数之间的残差和连接来实现,以支持不同拓扑结构的主流 CNN 的推断。通过利用不同级别的并行性、层融合以及充分利用数字信号处理块(DSP),对该架构进行了进一步优化。该加速器已在英特尔的 Arria 10 GX1150 硬件上实现,并使用广泛的基准模型进行了评估。结果表明,该加速器的吞吐量超过 1.3 TOP/s,计算(乘法累加(MAC))效率高达 97%,性能优于最先进的 FPGA 加速器。