Suppr超能文献

一款在低硅成本下,实现 99%PE 利用率和 100%内存隐藏的神经网络推理专用集成电路。

An ASIP for Neural Network Inference on Embedded Devices with 99% PE Utilization and 100% Memory Hidden under Low Silicon Cost.

机构信息

School of Information and Electronics, Beijing Institute of Technology, Beijing 100811, China.

出版信息

Sensors (Basel). 2022 May 19;22(10):3841. doi: 10.3390/s22103841.

Abstract

The computation efficiency and flexibility of the accelerator hinder deep neural network (DNN) implementation in embedded applications. Although there are many publications on deep neural network (DNN) processors, there is still much room for deep optimization to further improve results. Multiple dimensions must be simultaneously considered when designing a DNN processor to reach the performance limit of the architecture, including architecture decision, flexibility, energy efficiency, and silicon cost minimization. Flexibility is defined as the ability to support as many multiple networks as possible and to easily adjust the scale. For energy efficiency, there are huge opportunities for power efficiency optimization, which involves access minimization and memory latency minimization based on on-chip memory minimization. Therefore, this work focused on low-power and low-latency data access with minimized silicon cost. This research was implemented based on an ASIP (application specific instruction set processor) in which an ISA was based on the caffe2 inference operator and the hardware design was based on a single instruction multiple data (SIMD) architecture. The scalability and system performance of our SoC extension scheme were demonstrated. The VLIW was used to execute multiple instructions in parallel. All costs for data access time were thus eliminated for the convolution layer. Finally, the processor was synthesized based on TSMC 65 nm technology with a 200 MHz clock, and the Soc extension scheme was analyzed in an experimental model. Our design was tested on several typical neural networks, achieving 196 GOPS at 200 MHz and 241 GOPS/W on the VGG16Net and AlexNet.

摘要

加速器的计算效率和灵活性阻碍了深度神经网络(DNN)在嵌入式应用中的实现。尽管有许多关于深度神经网络(DNN)处理器的出版物,但仍有很大的深度优化空间,可以进一步提高结果。在设计 DNN 处理器时,必须同时考虑多个维度,以达到架构的性能极限,包括架构决策、灵活性、能效和硅片成本最小化。灵活性定义为支持尽可能多的多种网络以及轻松调整规模的能力。对于能效,有很大的优化机会,可以基于片上内存最小化来最小化访问和内存延迟。因此,这项工作专注于具有最小硅成本的低功耗和低延迟数据访问。这项研究是基于 ASIP(特定应用指令集处理器)实现的,其中 ISA 基于 caffe2 推理运算符,硬件设计基于单指令多数据(SIMD)架构。展示了我们的 SoC 扩展方案的可扩展性和系统性能。VLIW 用于并行执行多个指令。因此,卷积层的数据访问时间的所有开销都被消除了。最后,该处理器基于 TSMC 65nm 技术进行综合,时钟频率为 200MHz,并在实验模型中分析了 SoC 扩展方案。我们的设计在几个典型的神经网络上进行了测试,在 200MHz 时达到了 196GOPS,在 VGG16Net 和 AlexNet 上达到了 241GOPS/W。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/930b/9146143/28f7cdbfa490/sensors-22-03841-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验