Yan Zhihong, Zhang Bingqian, Wang Dong
Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China.
Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 100044, China.
Micromachines (Basel). 2024 Sep 19;15(9):1164. doi: 10.3390/mi15091164.
The You Only Look Once (YOLO) object detection network has garnered widespread adoption in various industries, owing to its superior inference speed and robust detection capabilities. This model has proven invaluable in automating production processes such as material processing, machining, and quality inspection. However, as market competition intensifies, there is a constant demand for higher detection speed and accuracy. Current FPGA accelerators based on 8-bit quantization have struggled to meet these increasingly stringent performance requirements. In response, we present a novel 4-bit quantization-based neural network accelerator for the YOLOv5 model, designed to enhance real-time processing capabilities while maintaining high detection accuracy. To achieve effective model compression, we introduce an optimized quantization scheme that reduces the bit-width of the entire YOLO network-including the first layer-to 4 bits, with only a 1.5% degradation in mean Average Precision (mAP). For the hardware implementation, we propose a unified Digital Signal Processor (DSP) packing scheme, coupled with a novel parity adder tree architecture that accommodates the proposed quantization strategies. This approach efficiently reduces on-chip DSP utilization by 50%, offering a significant improvement in performance and resource efficiency. Experimental results show that the industrial object detection system based on the proposed FPGA accelerator achieves a throughput of 808.6 GOPS and an efficiency of 0.49 GOPS/DSP for YOLOv5s on the ZCU102 board, which is 29% higher than a commercial FPGA accelerator design (Xilinx's Vitis AI).
你只看一次(YOLO)目标检测网络因其卓越的推理速度和强大的检测能力,在各个行业中得到了广泛应用。该模型在诸如材料加工、机械加工和质量检测等生产流程自动化中已证明具有极高价值。然而,随着市场竞争的加剧,对更高检测速度和精度的需求持续存在。当前基于8位量化的FPGA加速器难以满足这些日益严格的性能要求。为此,我们提出了一种新颖的基于4位量化的YOLOv5模型神经网络加速器,旨在提高实时处理能力的同时保持高检测精度。为实现有效的模型压缩,我们引入了一种优化的量化方案,将整个YOLO网络(包括第一层)的位宽降至4位,平均精度均值(mAP)仅下降1.5%。在硬件实现方面,我们提出了一种统一的数字信号处理器(DSP)打包方案,并结合一种新颖的奇偶加法树架构以适应所提出的量化策略。这种方法有效地将片上DSP利用率降低了50%,在性能和资源效率方面有显著提升。实验结果表明,基于所提出的FPGA加速器的工业目标检测系统在ZCU102板上对YOLOv5s实现了808.6 GOPS的吞吐量和0.49 GOPS/DSP的效率,比商业FPGA加速器设计(赛灵思的Vitis AI)高出29%。