Chen Zhaodong, Deng Lei, Li Guoqi, Sun Jiawei, Hu Xing, Liang Ling, Ding Yufei, Xie Yuan
IEEE Trans Neural Netw Learn Syst. 2021 Jan;32(1):348-362. doi: 10.1109/TNNLS.2020.2978753. Epub 2021 Jan 4.
Deep neural networks (DNNs) thrive in recent years, wherein batch normalization (BN) plays an indispensable role. However, it has been observed that BN is costly due to the huge reduction and elementwise operations that are hard to be executed in parallel, which heavily reduces the training speed. To address this issue, in this article, we propose a methodology to alleviate the BN's cost by using only a few sampled or generated data for mean and variance estimation at each iteration. The key challenge to reach this goal is how to achieve a satisfactory balance between normalization effectiveness and execution efficiency. We identify that the effectiveness expects less data correlation in sampling while the efficiency expects more regular execution patterns. To this end, we design two categories of approach: sampling or creating a few uncorrelated data for statistics' estimation with certain strategy constraints. The former includes "batch sampling (BS)" that randomly selects a few samples from each batch and "feature sampling (FS)" that randomly selects a small patch from each feature map of all samples, and the latter is "virtual data set normalization (VDN)" that generates a few synthetic random samples to directly create uncorrelated data for statistics' estimation. Accordingly, multiway strategies are designed to reduce the data correlation for accurate estimation and optimize the execution pattern for running acceleration in the meantime. The proposed methods are comprehensively evaluated on various DNN models, where the loss of model accuracy and the convergence rate are negligible. Without the support of any specialized libraries, 1.98× BN layer acceleration and 23.2% overall training speedup can be practically achieved on modern GPUs. Furthermore, our methods demonstrate powerful performance when solving the well-known "micro-BN" problem in the case of a tiny batch size. This article provides a promising solution for the efficient training of high-performance DNNs.
近年来,深度神经网络(DNN)蓬勃发展,其中批量归一化(BN)发挥着不可或缺的作用。然而,据观察,由于巨大的缩减和难以并行执行的逐元素操作,BN成本高昂,这严重降低了训练速度。为了解决这个问题,在本文中,我们提出了一种方法,通过在每次迭代中仅使用少量采样或生成的数据进行均值和方差估计,来减轻BN的成本。实现这一目标的关键挑战在于如何在归一化有效性和执行效率之间取得令人满意的平衡。我们发现,有效性要求采样时数据相关性较低,而效率要求执行模式更规则。为此,我们设计了两类方法:通过特定策略约束采样或创建少量不相关数据进行统计估计。前者包括“批量采样(BS)”,即从每个批次中随机选择少量样本,以及“特征采样(FS)”,即从所有样本的每个特征图中随机选择一个小补丁,后者是“虚拟数据集归一化(VDN)”,即生成少量合成随机样本以直接创建不相关数据进行统计估计。相应地,设计了多种策略来降低数据相关性以进行准确估计,同时优化执行模式以加速运行。我们在各种DNN模型上对所提出的方法进行了全面评估,其中模型精度损失和收敛速度可以忽略不计。在没有任何专门库支持的情况下,在现代GPU上实际可实现1.98倍的BN层加速和23.2%的整体训练加速。此外,我们的方法在解决小批量情况下著名的“微BN”问题时表现出强大的性能。本文为高效训练高性能DNN提供了一个有前景的解决方案。