• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种用于神经网络的高性能像素级全流水线硬件加速器。

A High-Performance Pixel-Level Fully Pipelined Hardware Accelerator for Neural Networks.

作者信息

Li Zhan, Zhang Zhihan, Hu Jie, Meng Qunkang, Shi Xingyu, Luo Jun, Wang Hao, Huang Qijun, Chang Sheng

出版信息

IEEE Trans Neural Netw Learn Syst. 2025 May;36(5):7970-7983. doi: 10.1109/TNNLS.2024.3423664. Epub 2025 May 6.

DOI:10.1109/TNNLS.2024.3423664
PMID:38995709
Abstract

The design of convolutional neural network (CNN) hardware accelerators based on a single computing engine (CE) architecture or multi-CE architecture has received widespread attention in recent years. Although this kind of hardware accelerator has advantages in hardware platform deployment flexibility and development cycle, it is still limited in resource utilization and data throughput. When processing large feature maps, the speed can usually only reach 10 frames/s, which does not meet the requirements of application scenarios, such as autonomous driving and radar detection. To solve the above problems, this article proposes a full pipeline hardware accelerator design based on pixel. By pixel-by-pixel strategy, the concept of the layer is downplayed, and the generation method of each pixel of the output feature map (Ofmap) can be optimized. To pipeline the entire computing system, we expand each layer of the neural network into hardware, eliminating the buffers between layers and maximizing the effect of complete connectivity across the entire network. This approach has yielded excellent performance. Besides that, as the pixel data stream is a fundamental paradigm in image processing, our fully pipelined hardware accelerator is universal for various CNNs (MobileNetV1, MobileNetV2 and FashionNet) in computer vision. As an example, the accelerator for MobileNetV1 achieves a speed of 4205.50 frames/s and a throughput of 4787.15 GOP/s at 211 MHz, with an output latency of 0.60 ms per image. This extremely shorts processing time and opens the door for AI's application in high-speed scenarios.

摘要

近年来,基于单计算引擎(CE)架构或多CE架构的卷积神经网络(CNN)硬件加速器设计受到了广泛关注。尽管这种硬件加速器在硬件平台部署灵活性和开发周期方面具有优势,但在资源利用率和数据吞吐量方面仍存在局限性。在处理大尺寸特征图时,速度通常只能达到10帧/秒,无法满足自动驾驶和雷达探测等应用场景的需求。为了解决上述问题,本文提出了一种基于像素的全流水线硬件加速器设计。通过逐像素策略,淡化了层的概念,可以优化输出特征图(Ofmap)每个像素的生成方法。为了使整个计算系统流水线化,我们将神经网络的每一层扩展到硬件中,消除层间缓冲区,并最大化整个网络完全连接的效果。这种方法取得了优异的性能。除此之外,由于像素数据流是图像处理中的一种基本范式,我们的全流水线硬件加速器对计算机视觉中的各种CNN(MobileNetV1、MobileNetV2和FashionNet)具有通用性。例如,MobileNetV1加速器在211MHz频率下实现了4205.50帧/秒的速度和4787.15 GOP/秒的吞吐量,每张图像的输出延迟为0.60毫秒。这种极短的处理时间为人工智能在高速场景中的应用打开了大门。

相似文献

1
A High-Performance Pixel-Level Fully Pipelined Hardware Accelerator for Neural Networks.一种用于神经网络的高性能像素级全流水线硬件加速器。
IEEE Trans Neural Netw Learn Syst. 2025 May;36(5):7970-7983. doi: 10.1109/TNNLS.2024.3423664. Epub 2025 May 6.
2
FPGA-Based High-Throughput CNN Hardware Accelerator With High Computing Resource Utilization Ratio.基于现场可编程门阵列的具有高计算资源利用率的高通量卷积神经网络硬件加速器
IEEE Trans Neural Netw Learn Syst. 2022 Aug;33(8):4069-4083. doi: 10.1109/TNNLS.2021.3055814. Epub 2022 Aug 3.
3
An OpenCL-Based FPGA Accelerator for Faster R-CNN.一种基于OpenCL的用于更快区域卷积神经网络(Faster R-CNN)的现场可编程门阵列(FPGA)加速器。
Entropy (Basel). 2022 Sep 23;24(10):1346. doi: 10.3390/e24101346.
4
AoCStream: All-on-Chip CNN Accelerator with Stream-Based Line-Buffer Architecture and Accelerator-Aware Pruning.AoCStream:具有基于流的行缓冲架构和加速器感知剪枝的全芯片CNN加速器
Sensors (Basel). 2023 Sep 27;23(19):8104. doi: 10.3390/s23198104.
5
QuantLaneNet: A 640-FPS and 34-GOPS/W FPGA-Based CNN Accelerator for Lane Detection.QuantLaneNet:一种基于FPGA的用于车道检测的640帧每秒且34千兆次运算每秒每瓦的卷积神经网络加速器。
Sensors (Basel). 2023 Jul 25;23(15):6661. doi: 10.3390/s23156661.
6
NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps.零跳:一种基于特征图稀疏表示的灵活卷积神经网络加速器。
IEEE Trans Neural Netw Learn Syst. 2019 Mar;30(3):644-656. doi: 10.1109/TNNLS.2018.2852335. Epub 2018 Jul 26.
7
A Cost-Efficient High-Speed VLSI Architecture for Spiking Convolutional Neural Network Inference Using Time-Step Binary Spike Maps.基于时间步长二值化 Spike 图的 Spike 卷积神经网络推理的高效高速 VLSI 架构
Sensors (Basel). 2021 Sep 8;21(18):6006. doi: 10.3390/s21186006.
8
Design of Fully Spectral CNNs for Efficient FPGA-Based Acceleration.用于基于现场可编程门阵列(FPGA)的高效加速的全谱卷积神经网络(CNN)设计
IEEE Trans Neural Netw Learn Syst. 2024 Jun;35(6):8111-8123. doi: 10.1109/TNNLS.2022.3224779. Epub 2024 Jun 3.
9
Flare: An FPGA-Based Full Precision Low Power CNN Accelerator with Reconfigurable Structure.Flare:一种基于现场可编程门阵列(FPGA)的具有可重构结构的全精度低功耗卷积神经网络(CNN)加速器。
Sensors (Basel). 2024 Mar 31;24(7):2239. doi: 10.3390/s24072239.
10
Towards high-performance deep learning architecture and hardware accelerator design for robust analysis in diffuse correlation spectroscopy.面向用于扩散相关光谱中稳健分析的高性能深度学习架构与硬件加速器设计
Comput Methods Programs Biomed. 2025 Jan;258:108471. doi: 10.1016/j.cmpb.2024.108471. Epub 2024 Oct 28.

引用本文的文献

1
Imaging flow cytometry with a real-time throughput beyond 1,000,000 events per second.每秒实时通量超过100万个事件的成像流式细胞术。
Light Sci Appl. 2025 Feb 10;14(1):76. doi: 10.1038/s41377-025-01754-9.
2
Nanoscale Titanium Oxide Memristive Structures for Neuromorphic Applications: Atomic Force Anodization Techniques, Modeling, Chemical Composition, and Resistive Switching Properties.用于神经形态应用的纳米级氧化钛忆阻结构:原子力阳极氧化技术、建模、化学成分及电阻开关特性
Nanomaterials (Basel). 2025 Jan 6;15(1):75. doi: 10.3390/nano15010075.