IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):5922-5934. doi: 10.1109/TNNLS.2018.2815085. Epub 2018 Apr 9.
The deep convolutional neural network (DCNN) is a class of machine learning algorithms based on feed-forward artificial neural network and is widely used for image processing applications. Implementation of DCNN in real-world problems needs high computational power and high memory bandwidth, in a power-constrained environment. A general purpose CPU cannot exploit different parallelisms offered by these algorithms and hence is slow and energy inefficient for practical use. We propose a field-programmable gate array (FPGA)-based runtime programmable coprocessor to accelerate feed-forward computation of DCNNs. The coprocessor can be programmed for a new network architecture at runtime without resynthesizing the FPGA hardware. Hence, it acts as a plug-and-use peripheral for the host computer. Caching is implemented for input features and filter weights using on-chip memory to reduce the external memory bandwidth requirement. Data are prefetched at several stages to avoid stalling of computational units and different optimization techniques are used to efficiently reuse the fetched data. Dataflow is dynamically adjusted in runtime for each DCNN layer to achieve consistent computational throughput across a wide range of input feature sizes and filter sizes. The coprocessor is prototyped using Xilinx Virtex-7 XC7VX485T FPGA-based VC707 board and operates at 150 MHz. Experimental results show that our implementation is energy efficient than highly optimized CPU implementation and achieves consistent computational throughput of more than 140 G operations/s for a wide range of input feature sizes and filter sizes. Off-chip memory transactions decrease by due to the use of the on-chip cache.
深度卷积神经网络(DCNN)是一类基于前馈人工神经网络的机器学习算法,广泛应用于图像处理应用。在受功率限制的环境中,DCNN 在实际问题中的实现需要高计算能力和高内存带宽。通用 CPU 无法利用这些算法提供的不同并行性,因此对于实际使用来说速度慢且能效低。我们提出了一种基于现场可编程门阵列(FPGA)的运行时可编程协处理器,以加速 DCNN 的前向计算。该协处理器可以在运行时针对新的网络架构进行编程,而无需重新综合 FPGA 硬件。因此,它可以作为主机计算机的即插即用外围设备。使用片上内存为输入特征和滤波器权重实现缓存,以减少对外部内存带宽的要求。数据在多个阶段预取,以避免计算单元的停顿,并使用不同的优化技术来有效地重用获取的数据。在运行时动态调整数据流,以实现各种输入特征大小和滤波器大小的一致计算吞吐量。该协处理器使用 Xilinx Virtex-7 XC7VX485T FPGA 为基础的 VC707 板进行原型设计,工作频率为 150 MHz。实验结果表明,与高度优化的 CPU 实现相比,我们的实现更节能,并且在各种输入特征大小和滤波器大小范围内实现了超过 140 G 操作/s 的一致计算吞吐量。由于使用了片上缓存,减少了片外内存事务。