IEEE Trans Neural Netw Learn Syst. 2018 May;29(5):1441-1453. doi: 10.1109/TNNLS.2017.2665555. Epub 2017 Mar 8.
Although there have been many decades of research and commercial presence on high performance general purpose processors, there are still many applications that require fully customized hardware architectures for further computational acceleration. Recently, deep learning has been successfully used to learn in a wide variety of applications, but their heavy computation demand has considerably limited their practical applications. This paper proposes a fully pipelined acceleration architecture to alleviate high computational demand of an artificial neural network (ANN) which is restricted Boltzmann machine (RBM) ANNs. The implemented RBM ANN accelerator (integrating network size, using 128 input cases per batch, and running at a 303-MHz clock frequency) integrated in a state-of-the art field-programmable gate array (FPGA) (Xilinx Virtex 7 XC7V-2000T) provides a computational performance of 301-billion connection-updates-per-second and about 193 times higher performance than a software solution running on general purpose processors. Most importantly, the architecture enables over 4 times (12 times in batch learning) higher performance compared with a previous work when both are implemented in an FPGA device (XC2VP70).
尽管在高性能通用处理器上已经进行了数十年的研究和商业化应用,但仍有许多应用需要完全定制的硬件架构来进一步实现计算加速。最近,深度学习已经成功地应用于各种不同的领域,但它们对计算的大量需求严重限制了它们的实际应用。本文提出了一种全流水线加速架构,以减轻人工神经网络(ANN)的高计算需求,人工神经网络是受限玻尔兹曼机(RBM)的神经网络。在一个最先进的现场可编程门阵列(FPGA)(Xilinx Virtex 7 XC7V-2000T)中集成实现的 RBM ANN 加速器(集成网络大小,每个批次使用 128 个输入案例,运行在 303MHz 的时钟频率下)提供了 3010 亿次连接更新/秒的计算性能,比在通用处理器上运行的软件解决方案高出约 193 倍。最重要的是,与在相同 FPGA 设备(XC2VP70)中实现的上一篇工作相比,该架构在批量学习时的性能提高了 4 倍以上(提高了 12 倍)。