Zhao Yunping, Lu Jianzhuang, Chen Xiaowen
College of Computer, National University of Defense Technology, Changsha 410073, China.
Sensors (Basel). 2020 Sep 28;20(19):5558. doi: 10.3390/s20195558.
Due to the high throughput and high computing capability of convolutional neural networks (CNNs), researchers are paying increasing attention to the design of CNNs hardware accelerator architecture. Accordingly, in this paper, we propose a block parallel computing algorithm based on the matrix transformation computing algorithm (MTCA) to realize the convolution expansion and resolve the block problem of the intermediate matrix. It enables high parallel implementation on hardware. Moreover, we also provide a specific calculation method for the optimal partition of matrix multiplication to optimize performance. In our evaluation, our proposed method saves more than 60% of hardware storage space compared with the (image to column) approach. More specifically, in the case of large-scale convolutions, it saves nearly 82% of storage space. Under the accelerator architecture framework designed in this paper, we realize the performance of 26.7GFLOPS-33.4GFLOPS (depending on convolution type) on FPGA(Field Programmable Gate Array) by reducing bandwidth and improving data reusability. It is 1.2×-4.0× faster than memory-efficient convolution (MEC) and , respectively, and represents an effective solution for a large-scale convolution accelerator.
由于卷积神经网络(CNN)具有高吞吐量和高计算能力,研究人员越来越关注CNN硬件加速器架构的设计。因此,在本文中,我们提出了一种基于矩阵变换计算算法(MTCA)的块并行计算算法,以实现卷积扩展并解决中间矩阵的块问题。它能够在硬件上实现高度并行。此外,我们还提供了一种针对矩阵乘法最优划分的具体计算方法,以优化性能。在我们的评估中,与(图像到列)方法相比,我们提出的方法节省了超过60%的硬件存储空间。更具体地说,在大规模卷积的情况下,它节省了近82%的存储空间。在本文设计的加速器架构框架下,我们通过减少带宽和提高数据重用性,在现场可编程门阵列(FPGA)上实现了26.7GFLOPS - 33.4GFLOPS(取决于卷积类型)的性能。它分别比内存高效卷积(MEC)快1.2倍至4.0倍,是大规模卷积加速器的有效解决方案。