Suppr超能文献

一种用于神经网络的高性能像素级全流水线硬件加速器。

A High-Performance Pixel-Level Fully Pipelined Hardware Accelerator for Neural Networks.

作者信息

Li Zhan, Zhang Zhihan, Hu Jie, Meng Qunkang, Shi Xingyu, Luo Jun, Wang Hao, Huang Qijun, Chang Sheng

出版信息

IEEE Trans Neural Netw Learn Syst. 2025 May;36(5):7970-7983. doi: 10.1109/TNNLS.2024.3423664. Epub 2025 May 6.

Abstract

The design of convolutional neural network (CNN) hardware accelerators based on a single computing engine (CE) architecture or multi-CE architecture has received widespread attention in recent years. Although this kind of hardware accelerator has advantages in hardware platform deployment flexibility and development cycle, it is still limited in resource utilization and data throughput. When processing large feature maps, the speed can usually only reach 10 frames/s, which does not meet the requirements of application scenarios, such as autonomous driving and radar detection. To solve the above problems, this article proposes a full pipeline hardware accelerator design based on pixel. By pixel-by-pixel strategy, the concept of the layer is downplayed, and the generation method of each pixel of the output feature map (Ofmap) can be optimized. To pipeline the entire computing system, we expand each layer of the neural network into hardware, eliminating the buffers between layers and maximizing the effect of complete connectivity across the entire network. This approach has yielded excellent performance. Besides that, as the pixel data stream is a fundamental paradigm in image processing, our fully pipelined hardware accelerator is universal for various CNNs (MobileNetV1, MobileNetV2 and FashionNet) in computer vision. As an example, the accelerator for MobileNetV1 achieves a speed of 4205.50 frames/s and a throughput of 4787.15 GOP/s at 211 MHz, with an output latency of 0.60 ms per image. This extremely shorts processing time and opens the door for AI's application in high-speed scenarios.

摘要

近年来,基于单计算引擎(CE)架构或多CE架构的卷积神经网络(CNN)硬件加速器设计受到了广泛关注。尽管这种硬件加速器在硬件平台部署灵活性和开发周期方面具有优势,但在资源利用率和数据吞吐量方面仍存在局限性。在处理大尺寸特征图时,速度通常只能达到10帧/秒,无法满足自动驾驶和雷达探测等应用场景的需求。为了解决上述问题,本文提出了一种基于像素的全流水线硬件加速器设计。通过逐像素策略,淡化了层的概念,可以优化输出特征图(Ofmap)每个像素的生成方法。为了使整个计算系统流水线化,我们将神经网络的每一层扩展到硬件中,消除层间缓冲区,并最大化整个网络完全连接的效果。这种方法取得了优异的性能。除此之外,由于像素数据流是图像处理中的一种基本范式,我们的全流水线硬件加速器对计算机视觉中的各种CNN(MobileNetV1、MobileNetV2和FashionNet)具有通用性。例如,MobileNetV1加速器在211MHz频率下实现了4205.50帧/秒的速度和4787.15 GOP/秒的吞吐量,每张图像的输出延迟为0.60毫秒。这种极短的处理时间为人工智能在高速场景中的应用打开了大门。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验