Suppr超能文献

深度学习加速器的配置空间探索对性能和资源利用的影响:Gemmini 案例研究。

Deep Learning Accelerators' Configuration Space Exploration Effect on Performance and Resource Utilization: A Gemmini Case Study.

机构信息

Electronics Division, Institute for Scientific and Technological Information, Council for Scientific and Industrial Research, Accra, Ghana.

Intelligent Image Processing Research Center, Korea Electronics Technology Institute, Seongnam-si 13488, Republic of Korea.

出版信息

Sensors (Basel). 2023 Feb 21;23(5):2380. doi: 10.3390/s23052380.

Abstract

Though custom deep learning (DL) hardware accelerators are attractive for making inferences in edge computing devices, their design and implementation remain a challenge. Open-source frameworks exist for exploring DL hardware accelerators. Gemmini is an open-source systolic array generator for agile DL accelerator exploration. This paper details the hardware/software components generated using Gemmini. The general matrix-to-matrix multiplication (GEMM) of different dataflow options, including output/weight stationary (OS/WS), was explored in Gemmini to estimate the performance relative to a CPU implementation. The Gemmini hardware was implemented on an FPGA device to explore the effect of several accelerator parameters, including array size, memory capacity, and the CPU/hardware image-to-column (im2col) module, on metrics such as the area, frequency, and power. This work revealed that regarding the performance, the WS dataflow offered a speedup of 3× relative to the OS dataflow, and the hardware im2col operation offered a speedup of 1.1× relative to the operation on the CPU. For hardware resources, an increase in the array size by a factor of 2 led to an increase in both the area and power by a factor of 3.3, and the im2col module led to an increase in area and power by factors of 1.01 and 1.06, respectively.

摘要

虽然定制的深度学习(DL)硬件加速器对于在边缘计算设备中进行推理很有吸引力,但它们的设计和实现仍然是一个挑战。现已有用于探索 DL 硬件加速器的开源框架。Gemmini 是一个用于敏捷 DL 加速器探索的开源脉动阵列生成器。本文详细介绍了使用 Gemmini 生成的硬件/软件组件。在 Gemmini 中探索了不同数据流选项(包括输出/权重静止(OS/WS))的通用矩阵到矩阵乘法(GEMM),以相对于 CPU 实现估计性能。在 FPGA 设备上实现了 Gemmini 硬件,以探索几个加速器参数(包括阵列大小、内存容量和 CPU/硬件图像到列(im2col)模块)对面积、频率和功率等指标的影响。这项工作表明,就性能而言,WS 数据流相对于 OS 数据流提供了 3 倍的加速,硬件 im2col 操作相对于 CPU 上的操作提供了 1.1 倍的加速。对于硬件资源,阵列大小增加两倍会导致面积和功率分别增加三倍,im2col 模块会导致面积和功率分别增加 1.01 倍和 1.06 倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e17/10007457/04adcb1797f6/sensors-23-02380-g001a.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验