CHEF：一种用于在具有异构FPGA的集群上部署异构模型的框架。

CHEF: A Framework for Deploying Heterogeneous Models on Clusters with Heterogeneous FPGAs.

作者信息

Tang Yue, Song Yukai, Elango Naveena, Priya Sheena Ratnam, Jones Alex K, Xiong Jinjun, Zhou Peipei, Hu Jingtong

机构信息

Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA 15261, USA.

Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY 14260, USA.

出版信息

IEEE Trans Comput Aided Des Integr Circuits Syst. 2024 Nov;43(11):3937-3948. doi: 10.1109/tcad.2024.3438994. Epub 2024 Nov 6.

DOI:10.1109/tcad.2024.3438994

PMID:39703437

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11654640/

Abstract

DNNs are rapidly evolving from streamlined single-modality single-task (SMST) to multi-modality multi-task (MMMT) with large variations for different layers and complex data dependencies among layers. To support such models, hardware systems also evolved to be heterogeneous. The heterogeneous system comes from the prevailing trend to integrate diverse accelerators into the system for lower latency. FPGAs have high computation density and communication bandwidth and are configurable to be deployed with different designs of accelerators, which are widely used for various machine-learning applications. However, scaling from SMST to MMMT on heterogeneous FPGAs is challenging since MMMT has much larger layer variations, a massive number of layers, and complex data dependency among different backbones. Previous mapping algorithms are either inefficient or over-simplified which makes them impractical in general scenarios. In this work, we propose CHEF to enable efficient implementation of MMMT models in realistic heterogeneous FPGA clusters, i.e. deploying heterogeneous accelerators on heterogeneous FPGAs (A2F) and mapping the heterogeneous DNNs on the deployed heterogeneous accelerators (M2A). We propose CHEF-A2F, a two-stage accelerators-to-FPGAs deployment approach to co-optimize hardware deployment and accelerator mapping. In addition, we propose CHEF-M2A, which can support general and practical cases compared to previous mapping algorithms. To the best of our knowledge, this is the first attempt to implement MMMT models in real heterogeneous FPGA clusters. Experimental results show that the latency obtained with CHEF is near-optimal while the search time is 10000X less than exhaustively searching the optimal solution.

摘要

深度神经网络（DNN）正在迅速从简化的单模态单任务（SMST）发展到多模态多任务（MMMT），不同层之间存在很大差异，并且层与层之间存在复杂的数据依赖关系。为了支持此类模型，硬件系统也发展为异构系统。异构系统源于将各种加速器集成到系统中以降低延迟的普遍趋势。现场可编程门阵列（FPGA）具有高计算密度和通信带宽，并且可配置为与不同设计的加速器一起部署，广泛用于各种机器学习应用。然而，在异构FPGA上从SMST扩展到MMMT具有挑战性，因为MMMT具有更大的层差异、大量的层以及不同主干之间复杂的数据依赖性。以前的映射算法要么效率低下，要么过于简化，这使得它们在一般场景中不实用。在这项工作中，我们提出了CHEF，以在实际的异构FPGA集群中高效实现MMMT模型，即在异构FPGA上部署异构加速器（A2F），并将异构DNN映射到已部署的异构加速器上（M2A）。我们提出了CHEF-A2F，这是一种两阶段的加速器到FPGA的部署方法，用于共同优化硬件部署和加速器映射。此外，我们提出了CHEF-M2A，与以前的映射算法相比，它可以支持一般和实际情况。据我们所知，这是首次尝试在实际的异构FPGA集群中实现MMMT模型。实验结果表明，使用CHEF获得的延迟接近最优，而搜索时间比穷举搜索最优解少10000倍。

相似文献

CHEF: A Framework for Deploying Heterogeneous Models on Clusters with Heterogeneous FPGAs.CHEF：一种用于在具有异构FPGA的集群上部署异构模型的框架。

IEEE Trans Comput Aided Des Integr Circuits Syst. 2024 Nov;43(11):3937-3948. doi: 10.1109/tcad.2024.3438994. Epub 2024 Nov 6.

Distributed large-scale graph processing on FPGAs.基于现场可编程门阵列（FPGA）的分布式大规模图形处理

J Big Data. 2023;10(1):95. doi: 10.1186/s40537-023-00756-x. Epub 2023 Jun 4.

Programming and Runtime Support to FPGA Accelerator Deployment at Datacenter Scale.面向数据中心规模的FPGA加速器部署的编程与运行时支持。

Proc ACM Symp Cloud Comput. 2016 Oct;2016:456-469. doi: 10.1145/2987550.2987569.

Machine learning algorithms for FPGA Implementation in biomedical engineering applications: A review.用于生物医学工程应用中FPGA实现的机器学习算法：综述

Heliyon. 2024 Feb 18;10(4):e26652. doi: 10.1016/j.heliyon.2024.e26652. eCollection 2024 Feb 29.

Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs.深度卷积神经网络在 FPGAs 上的全栈加速。

IEEE Trans Neural Netw Learn Syst. 2022 Aug;33(8):3974-3987. doi: 10.1109/TNNLS.2021.3055240. Epub 2022 Aug 3.

Quantization-Aware NN Layers with High-throughput FPGA Implementation for Edge AI.具有高吞吐量 FPGA 实现的量化感知神经网络层，用于边缘人工智能。

Sensors (Basel). 2023 May 11;23(10):4667. doi: 10.3390/s23104667.

Configurable Multi-Layer Perceptron-Based Soft Sensors on Embedded Field Programmable Gate Arrays: Targeting Diverse Deployment Goals in Fluid Flow Estimation.基于可配置多层感知器的嵌入式现场可编程门阵列软传感器：面向流体流量估计中的多样部署目标

Sensors (Basel). 2024 Dec 26;25(1):83. doi: 10.3390/s25010083.

FPGA-based neural network accelerators for millimeter-wave radio-over-fiber systems.用于毫米波光纤无线系统的基于现场可编程门阵列的神经网络加速器

Opt Express. 2020 Apr 27;28(9):13384-13400. doi: 10.1364/OE.391050.

FPNA: interaction between FPGA and neural computation.现场可编程神经网络阵列：FPGA与神经计算之间的交互。

Int J Neural Syst. 2000 Jun;10(3):243-59. doi: 10.1142/S0129065700000211.

An FPGA implementation of Bayesian inference with spiking neural networks.基于脉冲神经网络的贝叶斯推理的现场可编程门阵列实现。

Front Neurosci. 2024 Jan 5;17:1291051. doi: 10.3389/fnins.2023.1291051. eCollection 2023.

本文引用的文献

Extending High-Level Synthesis for Task-Parallel Programs.扩展任务并行程序的高级综合

Proc Annu IEEE Symp Field Program Cust Comput Mach. 2021 May;2021. doi: 10.1109/fccm51124.2021.00032. Epub 2021 Jun 2.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验