具有高吞吐量 FPGA 实现的量化感知神经网络层，用于边缘人工智能。

Quantization-Aware NN Layers with High-throughput FPGA Implementation for Edge AI.

机构信息

Dipartimento di Scienze Ambientali, Informatica e Statistica (DAIS), Università Ca'Foscari di Venezia, Via Torino 155, 30170 Venezia, Italy.

Dipartimento di Management, Università Ca'Foscari di Venezia, Cannaregio 873, 30121 Venezia, Italy.

出版信息

Sensors (Basel). 2023 May 11;23(10):4667. doi: 10.3390/s23104667.

DOI:10.3390/s23104667

PMID:37430583

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10222267/

Abstract

Over the past few years, several applications have been extensively exploiting the advantages of deep learning, in particular when using convolutional neural networks (CNNs). The intrinsic flexibility of such models makes them widely adopted in a variety of practical applications, from medical to industrial. In this latter scenario, however, using consumer Personal Computer (PC) hardware is not always suitable for the potential harsh conditions of the working environment and the strict timing that industrial applications typically have. Therefore, the design of custom FPGA (Field Programmable Gate Array) solutions for network inference is gaining massive attention from researchers and companies as well. In this paper, we propose a family of network architectures composed of three kinds of custom layers working with integer arithmetic with a customizable precision (down to just two bits). Such layers are designed to be effectively trained on classical GPUs (Graphics Processing Units) and then synthesized to FPGA hardware for real-time inference. The idea is to provide a trainable quantization layer, called , acting both as a non-linear activation for neurons and a value rescaler to match the desired bit precision. This way, the training is not only , but also capable of estimating the optimal scaling coefficients to accommodate both the non-linear nature of the activations and the constraints imposed by the limited precision. In the experimental section, we test the performance of this kind of model while working both on classical PC hardware and a case-study implementation of a signal peak detection device running on a real FPGA. We employ TensorFlow Lite for training and comparison, and use Xilinx FPGAs and Vivado for synthesis and implementation. The results show an accuracy of the quantized networks close to the floating point version, without the need for representative data for calibration as in other approaches, and performance that is better than dedicated peak detection algorithms. The FPGA implementation is able to run in real time at a rate of four gigapixels per second with moderate hardware resources, while achieving a sustained efficiency of 0.5 TOPS/W (tera operations per second per watt), in line with custom integrated hardware accelerators.

摘要

在过去的几年中，深度学习的几种应用已经在广泛地利用其优势，尤其是在使用卷积神经网络（CNN）时。这些模型的内在灵活性使得它们在从医疗到工业的各种实际应用中得到广泛采用。然而，在后者的情况下，使用消费类个人计算机（PC）硬件并不总是适合工作环境的潜在恶劣条件和工业应用通常具有的严格定时要求。因此，研究人员和公司也在大量关注用于网络推断的定制 FPGA（现场可编程门阵列）解决方案的设计。在本文中，我们提出了一系列由三种自定义层组成的网络架构，这些层使用具有可定制精度（低至两位）的整数算法进行工作。这些层旨在在经典 GPU（图形处理单元）上进行有效训练，然后综合到 FPGA 硬件中进行实时推断。其思想是提供一个可训练的量化层，称为，它既是神经元的非线性激活函数，又是值重缩放器，以匹配所需的位精度。这样，训练不仅可以，还能够估计出最佳的缩放系数，以适应激活的非线性性质和有限精度的限制。在实验部分，我们在经典 PC 硬件和在实际 FPGA 上运行的信号峰值检测设备的案例研究实现上同时测试了这种模型的性能。我们使用 TensorFlow Lite 进行训练和比较，并使用 Xilinx FPGAs 和 Vivado 进行综合和实现。结果表明，量化网络的准确性接近浮点版本，而无需像其他方法那样需要代表性数据进行校准，并且性能优于专用的峰值检测算法。FPGA 实现能够以适度的硬件资源以每秒 40 亿像素的速度实时运行，同时实现 0.5TOPS/W（每秒万亿次运算每瓦特）的持续效率，与定制的集成硬件加速器相匹配。