• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于 FPGA 的可运行时编程和内存带宽优化的深度卷积神经网络协处理器。

Runtime Programmable and Memory Bandwidth Optimized FPGA-Based Coprocessor for Deep Convolutional Neural Network.

出版信息

IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):5922-5934. doi: 10.1109/TNNLS.2018.2815085. Epub 2018 Apr 9.

DOI:10.1109/TNNLS.2018.2815085
PMID:29993989
Abstract

The deep convolutional neural network (DCNN) is a class of machine learning algorithms based on feed-forward artificial neural network and is widely used for image processing applications. Implementation of DCNN in real-world problems needs high computational power and high memory bandwidth, in a power-constrained environment. A general purpose CPU cannot exploit different parallelisms offered by these algorithms and hence is slow and energy inefficient for practical use. We propose a field-programmable gate array (FPGA)-based runtime programmable coprocessor to accelerate feed-forward computation of DCNNs. The coprocessor can be programmed for a new network architecture at runtime without resynthesizing the FPGA hardware. Hence, it acts as a plug-and-use peripheral for the host computer. Caching is implemented for input features and filter weights using on-chip memory to reduce the external memory bandwidth requirement. Data are prefetched at several stages to avoid stalling of computational units and different optimization techniques are used to efficiently reuse the fetched data. Dataflow is dynamically adjusted in runtime for each DCNN layer to achieve consistent computational throughput across a wide range of input feature sizes and filter sizes. The coprocessor is prototyped using Xilinx Virtex-7 XC7VX485T FPGA-based VC707 board and operates at 150 MHz. Experimental results show that our implementation is energy efficient than highly optimized CPU implementation and achieves consistent computational throughput of more than 140 G operations/s for a wide range of input feature sizes and filter sizes. Off-chip memory transactions decrease by due to the use of the on-chip cache.

摘要

深度卷积神经网络(DCNN)是一类基于前馈人工神经网络的机器学习算法,广泛应用于图像处理应用。在受功率限制的环境中,DCNN 在实际问题中的实现需要高计算能力和高内存带宽。通用 CPU 无法利用这些算法提供的不同并行性,因此对于实际使用来说速度慢且能效低。我们提出了一种基于现场可编程门阵列(FPGA)的运行时可编程协处理器,以加速 DCNN 的前向计算。该协处理器可以在运行时针对新的网络架构进行编程,而无需重新综合 FPGA 硬件。因此,它可以作为主机计算机的即插即用外围设备。使用片上内存为输入特征和滤波器权重实现缓存,以减少对外部内存带宽的要求。数据在多个阶段预取,以避免计算单元的停顿,并使用不同的优化技术来有效地重用获取的数据。在运行时动态调整数据流,以实现各种输入特征大小和滤波器大小的一致计算吞吐量。该协处理器使用 Xilinx Virtex-7 XC7VX485T FPGA 为基础的 VC707 板进行原型设计,工作频率为 150 MHz。实验结果表明,与高度优化的 CPU 实现相比,我们的实现更节能,并且在各种输入特征大小和滤波器大小范围内实现了超过 140 G 操作/s 的一致计算吞吐量。由于使用了片上缓存,减少了片外内存事务。

相似文献

1
Runtime Programmable and Memory Bandwidth Optimized FPGA-Based Coprocessor for Deep Convolutional Neural Network.基于 FPGA 的可运行时编程和内存带宽优化的深度卷积神经网络协处理器。
IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):5922-5934. doi: 10.1109/TNNLS.2018.2815085. Epub 2018 Apr 9.
2
Extending the BEAGLE library to a multi-FPGA platform.将 BEAGLE 库扩展到多 FPGA 平台。
BMC Bioinformatics. 2013 Jan 19;14:25. doi: 10.1186/1471-2105-14-25.
3
NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps.零跳:一种基于特征图稀疏表示的灵活卷积神经网络加速器。
IEEE Trans Neural Netw Learn Syst. 2019 Mar;30(3):644-656. doi: 10.1109/TNNLS.2018.2852335. Epub 2018 Jul 26.
4
Optimization of Deep Neural Networks Using SoCs with OpenCL.利用具有 OpenCL 的 SoC 优化深度神经网络。
Sensors (Basel). 2018 Apr 30;18(5):1384. doi: 10.3390/s18051384.
5
Field-programmable gate array implementation of a probabilistic neural network for motor cortical decoding in rats.现场可编程门阵列实现大鼠运动皮层解码的概率神经网络。
J Neurosci Methods. 2010 Jan 15;185(2):299-306. doi: 10.1016/j.jneumeth.2009.10.001. Epub 2009 Oct 29.
6
Efficient FPGA Implementation of Convolutional Neural Networks and Long Short-Term Memory for Radar Emitter Signal Recognition.用于雷达辐射源信号识别的卷积神经网络和长短期记忆网络的高效现场可编程门阵列实现
Sensors (Basel). 2024 Jan 30;24(3):889. doi: 10.3390/s24030889.
7
High-performance reconfigurable hardware architecture for restricted Boltzmann machines.用于受限玻尔兹曼机的高性能可重构硬件架构。
IEEE Trans Neural Netw. 2010 Nov;21(11):1780-92. doi: 10.1109/TNN.2010.2073481. Epub 2010 Sep 20.
8
Acceleration of Deep Neural Network Training Using Field Programmable Gate Arrays.使用现场可编程门阵列加速深度神经网络训练。
Comput Intell Neurosci. 2022 Oct 17;2022:8387364. doi: 10.1155/2022/8387364. eCollection 2022.
9
Feedforward neural network implementation in FPGA using layer multiplexing for effective resource utilization.利用层复用在现场可编程门阵列(FPGA)中实现前馈神经网络以有效利用资源。
IEEE Trans Neural Netw. 2007 May;18(3):880-8. doi: 10.1109/TNN.2007.891626.
10
DeepX: Deep Learning Accelerator for Restricted Boltzmann Machine Artificial Neural Networks.深思维:受限玻尔兹曼机人工神经网络的深度学习加速器。
IEEE Trans Neural Netw Learn Syst. 2018 May;29(5):1441-1453. doi: 10.1109/TNNLS.2017.2665555. Epub 2017 Mar 8.

引用本文的文献

1
Application of the artificial intelligence system based on graphics and vision in ethnic tourism of subtropical grasslands.基于图形与视觉的人工智能系统在亚热带草原民族旅游中的应用。
Heliyon. 2024 May 17;10(11):e31442. doi: 10.1016/j.heliyon.2024.e31442. eCollection 2024 Jun 15.
2
Open-Source FPGA Coprocessor for the Doppler Emulation of Moving Fluids.用于移动流体多普勒仿真的开源现场可编程门阵列协处理器
Micromachines (Basel). 2021 Dec 12;12(12):1549. doi: 10.3390/mi12121549.