Huang Wenjin, Wu Huangtao, Chen Qingkun, Luo Conghui, Zeng Shihao, Li Tianrui, Huang Yihua
IEEE Trans Neural Netw Learn Syst. 2022 Aug;33(8):4069-4083. doi: 10.1109/TNNLS.2021.3055814. Epub 2022 Aug 3.
The field-programmable gate array (FPGA)-based CNN hardware accelerator adopting single-computing-engine (CE) architecture or multi-CE architecture has attracted great attention in recent years. The actual throughput of the accelerator is also getting higher and higher but is still far below the theoretical throughput due to the inefficient computing resource mapping mechanism and data supply problem, and so on. To solve these problems, a novel composite hardware CNN accelerator architecture is proposed in this article. To perform the convolution layer (CL) efficiently, a novel multiCE architecture based on a row-level pipelined streaming strategy is proposed. For each CE, an optimized mapping mechanism is proposed to improve its computing resource utilization ratio and an efficient data system with continuous data supply is designed to avoid the idle state of the CE. Besides, to relieve the off-chip bandwidth stress, a weight data allocation strategy is proposed. To perform the fully connected layer (FCL), a single-CE architecture based on a batch-based computing method is proposed. Based on these design methods and strategies, visual geometry group network-16 (VGG-16) and ResNet-101 are both implemented on the XC7VX980T FPGA platform. The VGG-16 accelerator consumed 3395 multipliers and got the throughput of 1 TOPS at 150 MHz, that is, about 98.15% of the theoretical throughput ( 2 ×3395 ×150 MOPS). Similarly, the ResNet-101 accelerator achieved 600 GOPS at 100 MHz, about 96.12% of the theoretical throughput ( 2 ×3121 ×100 MOPS).
近年来,基于现场可编程门阵列(FPGA)的采用单计算引擎(CE)架构或多CE架构的卷积神经网络(CNN)硬件加速器备受关注。由于计算资源映射机制效率低下和数据供应问题等,加速器的实际吞吐量越来越高,但仍远低于理论吞吐量。为了解决这些问题,本文提出了一种新颖的复合硬件CNN加速器架构。为了高效执行卷积层(CL),提出了一种基于行级流水线流策略的新颖多CE架构。对于每个CE,提出了一种优化的映射机制以提高其计算资源利用率,并设计了一种具有连续数据供应的高效数据系统以避免CE的空闲状态。此外,为了缓解片外带宽压力,提出了一种权重数据分配策略。为了执行全连接层(FCL),提出了一种基于批量计算方法的单CE架构。基于这些设计方法和策略,视觉几何组网络16(VGG-16)和ResNet-101都在XC7VX980T FPGA平台上实现。VGG-16加速器消耗3395个乘法器,在150 MHz时吞吐量为1 TOPS,即约为理论吞吐量(2×3395×150 MOPS)的98.15%。同样,ResNet-101加速器在100 MHz时实现了600 GOPS,约为理论吞吐量(2×3121×100 MOPS)的96.12%。