Suppr超能文献

基于搜索的计算硬件架构,具有精确可扩展和计算可重构方案。

SCA: Search-Based Computing Hardware Architecture with Precision Scalable and Computation Reconfigurable Scheme.

机构信息

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.

出版信息

Sensors (Basel). 2022 Nov 6;22(21):8545. doi: 10.3390/s22218545.

Abstract

Deep neural networks have been deployed in various hardware accelerators, such as graph process units (GPUs), field-program gate arrays (FPGAs), and application specific integrated circuit (ASIC) chips. Normally, a huge amount of computation is required in the inference process, creating significant logic resource overheads. In addition, frequent data accessions between off-chip memory and hardware accelerators create bottlenecks, leading to decline in hardware efficiency. Many solutions have been proposed to reduce hardware overhead and data movements. For example, specific lookup-table (LUT)-based hardware architecture can be used to mitigate computing operation demands. However, typical LUT-based accelerators are affected by computational precision limitation and poor scalability issues. In this paper, we propose a search-based computing scheme based on an LUT solution, which improves computation efficiency by replacing traditional multiplication with a search operation. In addition, the proposed scheme supports different precision multiple-bit widths to meet the needs of different DNN-based applications. We design a reconfigurable computing strategy, which can efficiently adapt to the convolution of different kernel sizes to improve hardware scalability. We implement a search-based architecture, namely SCA, which adopts an on-chip storage mechanism, thus greatly reducing interactions with off-chip memory and alleviating bandwidth pressure. Based on experimental evaluation, the proposed SCA architecture can achieve 92%, 96% and 98% computational utilization for computational precision of 4 bit, 8 bit and 16 bit, respectively. Compared with state-of-the-art LUT-based architecture, the efficiency can be improved four-fold.

摘要

深度神经网络已经部署在各种硬件加速器中,如图形处理单元 (GPU)、现场可编程门阵列 (FPGA) 和专用集成电路 (ASIC) 芯片。通常,推理过程需要大量的计算,这会导致大量的逻辑资源开销。此外,片外存储器和硬件加速器之间频繁的数据访问会造成瓶颈,导致硬件效率下降。已经提出了许多解决方案来减少硬件开销和数据移动。例如,可以使用基于特定查找表 (LUT) 的硬件架构来减轻计算操作的需求。然而,典型的基于 LUT 的加速器受到计算精度限制和可扩展性差的问题的影响。在本文中,我们提出了一种基于 LUT 解决方案的搜索计算方案,通过用搜索操作代替传统乘法来提高计算效率。此外,该方案支持不同精度的多位宽度,以满足不同基于 DNN 的应用的需求。我们设计了一种可重构计算策略,能够有效地适应不同核大小的卷积,提高硬件的可扩展性。我们实现了一种基于搜索的架构,即 SCA,它采用片上存储机制,从而大大减少了与片外存储器的交互,缓解了带宽压力。基于实验评估,所提出的 SCA 架构在 4 位、8 位和 16 位计算精度下分别实现了 92%、96%和 98%的计算利用率。与最先进的基于 LUT 的架构相比,效率可以提高四倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e92b/9658340/d759b56025e7/sensors-22-08545-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验