CISUC, Department of Informatics Engineering, University of Coimbra, Portugal.
Int J Neural Syst. 2011 Feb;21(1):31-47. doi: 10.1142/S0129065711002638.
The Graphics Processing Unit (GPU) originally designed for rendering graphics and which is difficult to program for other tasks, has since evolved into a device suitable for general-purpose computations. As a result graphics hardware has become progressively more attractive yielding unprecedented performance at a relatively low cost. Thus, it is the ideal candidate to accelerate a wide variety of data parallel tasks in many fields such as in Machine Learning (ML). As problems become more and more demanding, parallel implementations of learning algorithms are crucial for a useful application. In particular, the implementation of Neural Networks (NNs) in GPUs can significantly reduce the long training times during the learning process. In this paper we present a GPU parallel implementation of the Back-Propagation (BP) and Multiple Back-Propagation (MBP) algorithms, and describe the GPU kernels needed for this task. The results obtained on well-known benchmarks show faster training times and improved performances as compared to the implementation in traditional hardware, due to maximized floating-point throughput and memory bandwidth. Moreover, a preliminary GPU based Autonomous Training System (ATS) is developed which aims at automatically finding high-quality NNs-based solutions for a given problem.
图形处理单元(GPU)最初是为渲染图形而设计的,很难用于其他任务的编程,但后来演变成了一种适用于通用计算的设备。因此,图形硬件变得越来越有吸引力,以相对较低的成本实现了前所未有的性能。因此,它是加速许多领域(如机器学习(ML))中各种数据并行任务的理想候选者。随着问题变得越来越复杂,学习算法的并行实现对于有用的应用至关重要。特别是,神经网络(NN)在 GPU 上的实现可以显著减少学习过程中的长时间训练。在本文中,我们提出了一种 GPU 并行实现反向传播(BP)和多次反向传播(MBP)算法的方法,并描述了实现此任务所需的 GPU 内核。在著名的基准测试上获得的结果表明,与传统硬件的实现相比,训练时间更快,性能得到了提高,这是由于最大化了浮点吞吐量和内存带宽。此外,还开发了一个初步的基于 GPU 的自主训练系统(ATS),旨在为给定问题自动找到基于高质量神经网络的解决方案。