IEEE Trans Neural Netw Learn Syst. 2022 Jul;33(7):2853-2866. doi: 10.1109/TNNLS.2020.3046452. Epub 2022 Jul 6.
Real-time in situ image analytics impose stringent latency requirements on intelligent neural network inference operations. While conventional software-based implementations on the graphic processing unit (GPU)-accelerated platforms are flexible and have achieved very high inference throughput, they are not suitable for latency-sensitive applications where real-time feedback is needed. Here, we demonstrate that high-performance reconfigurable computing platforms based on field-programmable gate array (FPGA) processing can successfully bridge the gap between low-level hardware processing and high-level intelligent image analytics algorithm deployment within a unified system. The proposed design performs inference operations on a stream of individual images as they are produced and has a deeply pipelined hardware design that allows all layers of a quantized convolutional neural network (QCNN) to compute concurrently with partial image inputs. Using the case of label-free classification of human peripheral blood mononuclear cell (PBMC) subtypes as a proof-of-concept illustration, our system achieves an ultralow classification latency of 34.2 [Formula: see text] with over 95% end-to-end accuracy by using a QCNN, while the cells are imaged at throughput exceeding 29 200 cells/s. Our QCNN design is modular and is readily adaptable to other QCNNs with different latency and resource requirements.
实时原位图像分析对智能神经网络推断操作提出了严格的延迟要求。虽然基于图形处理单元 (GPU) 加速平台的传统软件实现具有灵活性,并实现了非常高的推断吞吐量,但它们不适合需要实时反馈的延迟敏感应用程序。在这里,我们证明了基于现场可编程门阵列 (FPGA) 处理的高性能可重构计算平台可以成功弥合低级别硬件处理和高级智能图像分析算法在统一系统内部署之间的差距。所提出的设计在逐个生成图像的流上执行推断操作,并且具有深度流水线硬件设计,允许量化卷积神经网络 (QCNN) 的所有层与部分图像输入同时计算。使用无标记分类人类外周血单核细胞 (PBMC) 亚型的情况作为概念验证说明,我们的系统通过使用 QCNN 实现了超低的分类延迟 34.2 [公式:见正文],同时超过 95%的端到端准确性,而细胞的成像速度超过 29200 个/秒。我们的 QCNN 设计是模块化的,并且可以轻松适应其他具有不同延迟和资源要求的 QCNN。