Liu Chang, Zhang Xishan, Zhang Rui, Li Ling, Zhou Shiyi, Huang Di, Li Zhen, Du Zidong, Liu Shaoli, Chen Tianshi
IEEE Trans Image Process. 2022;31:7006-7019. doi: 10.1109/TIP.2022.3216776. Epub 2022 Nov 14.
Quantization is a promising technique to reduce the computation and storage costs of DNNs. Low-bit ( ≤ 8 bits) precision training remains an open problem due to the difficulty of gradient quantization. In this paper, we find two long-standing misunderstandings of the bias of gradient quantization noise. First, the large bias of gradient quantization noise, instead of the variance, is the key factor of training accuracy loss. Second, the widely used stochastic rounding cannot solve the training crash problem caused by the gradient quantization bias in practice. Moreover, we find that the asymmetric distribution of gradients causes a large bias of gradient quantization noise. Based on our findings, we propose a novel adaptive piecewise quantization method to effectively limit the bias of gradient quantization noise. Accordingly, we propose a new data format, Piecewise Fixed Point (PWF), to present data after quantization. We apply our method to different applications including image classification, machine translation, optical character recognition, and text classification. We achieve approximately 1.9 ∼ 3.5× speedup compared with full precision training with an accuracy loss of less than 0.5%. To the best of our knowledge, this is the first work to quantize gradients of all layers to 8 bits in both large-scale CNN and RNN training with negligible accuracy loss.
量化是一种很有前景的技术,可降低深度神经网络(DNN)的计算和存储成本。由于梯度量化困难,低比特(≤8比特)精度训练仍是一个未解决的问题。在本文中,我们发现了关于梯度量化噪声偏差的两个长期存在的误解。第一,梯度量化噪声的大偏差而非方差,是训练精度损失的关键因素。第二,在实际中,广泛使用的随机舍入无法解决由梯度量化偏差导致的训练崩溃问题。此外,我们发现梯度的不对称分布会导致梯度量化噪声出现大偏差。基于我们的发现,我们提出了一种新颖的自适应分段量化方法,以有效限制梯度量化噪声的偏差。相应地,我们提出了一种新的数据格式,即分段定点(PWF),用于表示量化后的数据。我们将我们的方法应用于不同的应用,包括图像分类、机器翻译、光学字符识别和文本分类。与全精度训练相比,我们实现了约1.9至3.5倍的加速,精度损失小于0.5%。据我们所知,这是第一项在大规模卷积神经网络(CNN)和循环神经网络(RNN)训练中将所有层的梯度量化到8比特且精度损失可忽略不计的工作。