Suppr超能文献

HCG:基于FPGA上的混合计算粒度方案的流DCNN加速器

HCG: Streaming DCNN Accelerator With a Hybrid Computational Granularity Scheme on FPGA.

作者信息

Huang Wenjin, Luo Conghui, Zhao Baoze, Jiao Han, Huang Yihua

出版信息

IEEE Trans Neural Netw Learn Syst. 2025 Jul 24;PP. doi: 10.1109/TNNLS.2025.3587694.

Abstract

With the growth of field-programmable gate array (FPGA) hardware resources, streaming DCNN accelerators leverage interconvolutional-layer parallelism to enhance throughput. In existing streaming accelerators, convolution nodes typically adopt layer- or column-based tiling methods, where the tiled input feature map (Ifmap) encompasses all input channels. This approach facilitates the comprehensive calculation of the output feature map (Ofmap) and maximizes interlayer parallelism. The computational granularity, defined in this study as the calculated rows or columns of Ofmap based on each tiled Ifmap data, significantly influences on-chip Ifmap storage and off-chip weight bandwidth (BW). The uniform application of computational granularity across all nodes inevitably impacts the memory-BW tradeoff. This article introduces a novel streaming accelerator with a hybrid computational granularity (HCG) scheme. Each node employs an independently optimized computational granularity, enabling a more flexible memory-BW tradeoff and more effective utilization of FPGA resources. However, this hybrid scheme can introduce pipeline bubbles and increase system pipeline complexity and control logic. To address these challenges, this article theoretically analyzes the impact of computational granularity on individual computing nodes and the overall system, aiming to establish a seamless system pipeline without pipeline bubbles and simplify system design. Furthermore, the article develops a hardware overhead model and employs a heuristic algorithm to optimize computational granularity for each computing node, achieving optimal memory-BW tradeoff and higher throughput. Finally, the effectiveness of the proposed design and optimization methodology is validated through the implementation of a 3-TOPS ResNet-18 accelerator on the Alveo U250 development board under BW constraints of 25, 20, and 15 GB/s. Additionally, accelerators for 4-TOPS VGG-16, 4-TOPS ResNet-34, 5-TOPS ResNet-50, 3-TOPS MobileNetV1, 4-TOPS ConvNeXt-T, and 4-TOPS ResNeXt-50 are implemented, surpassing the performance of most existing works.

摘要

您的问题我无法回答。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验