Shin Dongyeob, Kim Geonho, Jo Joongho, Park Jongsun
IEEE Trans Neural Netw Learn Syst. 2023 Sep;34(9):5745-5759. doi: 10.1109/TNNLS.2021.3130991. Epub 2023 Sep 1.
Deep neural network (DNN) training is an iterative process of updating network weights, called gradient computation, where (mini-batch) stochastic gradient descent (SGD) algorithm is generally used. Since SGD inherently allows gradient computations with noise, the proper approximation of computing weight gradients within SGD noise can be a promising technique to save energy/time consumptions during DNN training. This article proposes two novel techniques to reduce the computational complexity of the gradient computations for the acceleration of SGD-based DNN training. First, considering that the output predictions of a network (confidence) change with training inputs, the relation between the confidence and the magnitude of the weight gradient can be exploited to skip the gradient computations without seriously sacrificing the accuracy, especially for high confidence inputs. Second, the angle diversity-based approximations of intermediate activations for weight gradient calculation are also presented. Based on the fact that the angle diversity of gradients is small (highly uncorrelated) in the early training epoch, the bit precision of activations can be reduced to 2-/4-/8-bit depending on the resulting angle error between the original gradient and quantized gradient. The simulations show that the proposed approach can skip up to 75.83% of gradient computations with negligible accuracy degradation for CIFAR-10 dataset using ResNet-20. Hardware implementation results using 65-nm CMOS technology also show that the proposed training accelerator achieves up to 1.69× energy efficiency compared with other training accelerators.
深度神经网络(DNN)训练是一个更新网络权重的迭代过程,称为梯度计算,通常使用(小批量)随机梯度下降(SGD)算法。由于SGD本质上允许进行带噪声的梯度计算,在SGD噪声范围内对计算权重梯度进行适当近似,可能是一种在DNN训练期间节省能量/时间消耗的有前景的技术。本文提出了两种新颖的技术,以降低梯度计算的计算复杂度,从而加速基于SGD的DNN训练。首先,考虑到网络的输出预测(置信度)会随训练输入而变化,可以利用置信度与权重梯度大小之间的关系来跳过梯度计算,而不会严重牺牲准确性,特别是对于高置信度输入。其次,还提出了基于角度分集的中间激活近似方法用于权重梯度计算。基于早期训练阶段梯度的角度分集较小(高度不相关)这一事实,根据原始梯度与量化梯度之间产生的角度误差,激活的比特精度可以降低到2位/4位/8位。仿真结果表明,对于使用ResNet-20的CIFAR-10数据集,所提出的方法可以跳过高达75.83%的梯度计算,且精度下降可忽略不计。使用65纳米CMOS技术的硬件实现结果还表明,与其他训练加速器相比,所提出的训练加速器实现了高达1.69倍的能源效率。