Tang Yue, Song Yukai, Elango Naveena, Priya Sheena Ratnam, Jones Alex K, Xiong Jinjun, Zhou Peipei, Hu Jingtong
Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA 15261, USA.
Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY 14260, USA.
IEEE Trans Comput Aided Des Integr Circuits Syst. 2024 Nov;43(11):3937-3948. doi: 10.1109/tcad.2024.3438994. Epub 2024 Nov 6.
DNNs are rapidly evolving from streamlined single-modality single-task (SMST) to multi-modality multi-task (MMMT) with large variations for different layers and complex data dependencies among layers. To support such models, hardware systems also evolved to be heterogeneous. The heterogeneous system comes from the prevailing trend to integrate diverse accelerators into the system for lower latency. FPGAs have high computation density and communication bandwidth and are configurable to be deployed with different designs of accelerators, which are widely used for various machine-learning applications. However, scaling from SMST to MMMT on heterogeneous FPGAs is challenging since MMMT has much larger layer variations, a massive number of layers, and complex data dependency among different backbones. Previous mapping algorithms are either inefficient or over-simplified which makes them impractical in general scenarios. In this work, we propose CHEF to enable efficient implementation of MMMT models in realistic heterogeneous FPGA clusters, i.e. deploying heterogeneous accelerators on heterogeneous FPGAs (A2F) and mapping the heterogeneous DNNs on the deployed heterogeneous accelerators (M2A). We propose CHEF-A2F, a two-stage accelerators-to-FPGAs deployment approach to co-optimize hardware deployment and accelerator mapping. In addition, we propose CHEF-M2A, which can support general and practical cases compared to previous mapping algorithms. To the best of our knowledge, this is the first attempt to implement MMMT models in real heterogeneous FPGA clusters. Experimental results show that the latency obtained with CHEF is near-optimal while the search time is 10000X less than exhaustively searching the optimal solution.
深度神经网络(DNN)正在迅速从简化的单模态单任务(SMST)发展到多模态多任务(MMMT),不同层之间存在很大差异,并且层与层之间存在复杂的数据依赖关系。为了支持此类模型,硬件系统也发展为异构系统。异构系统源于将各种加速器集成到系统中以降低延迟的普遍趋势。现场可编程门阵列(FPGA)具有高计算密度和通信带宽,并且可配置为与不同设计的加速器一起部署,广泛用于各种机器学习应用。然而,在异构FPGA上从SMST扩展到MMMT具有挑战性,因为MMMT具有更大的层差异、大量的层以及不同主干之间复杂的数据依赖性。以前的映射算法要么效率低下,要么过于简化,这使得它们在一般场景中不实用。在这项工作中,我们提出了CHEF,以在实际的异构FPGA集群中高效实现MMMT模型,即在异构FPGA上部署异构加速器(A2F),并将异构DNN映射到已部署的异构加速器上(M2A)。我们提出了CHEF-A2F,这是一种两阶段的加速器到FPGA的部署方法,用于共同优化硬件部署和加速器映射。此外,我们提出了CHEF-M2A,与以前的映射算法相比,它可以支持一般和实际情况。据我们所知,这是首次尝试在实际的异构FPGA集群中实现MMMT模型。实验结果表明,使用CHEF获得的延迟接近最优,而搜索时间比穷举搜索最优解少10000倍。