IEEE Trans Pattern Anal Mach Intell. 2022 Jan;44(1):404-415. doi: 10.1109/TPAMI.2020.3004354. Epub 2021 Dec 7.
Large-scale distributed training of deep neural networks results in models with worse generalization performance as a result of the increase in the effective mini-batch size. Previous approaches attempt to address this problem by varying the learning rate and batch size over epochs and layers, or ad hoc modifications of batch normalization. We propose scalable and practical natural gradient descent (SP-NGD), a principled approach for training models that allows them to attain similar generalization performance to models trained with first-order optimization methods, but with accelerated convergence. Furthermore, SP-NGD scales to large mini-batch sizes with a negligible computational overhead as compared to first-order methods. We evaluated SP-NGD on a benchmark task where highly optimized first-order methods are available as references: training a ResNet-50 model for image classification on ImageNet. We demonstrate convergence to a top-1 validation accuracy of 75.4 percent in 5.5 minutes using a mini-batch size of 32,768 with 1,024 GPUs, as well as an accuracy of 74.9 percent with an extremely large mini-batch size of 131,072 in 873 steps of SP-NGD.
大规模分布式训练深度神经网络会导致有效 mini-batch 大小增加,从而导致模型的泛化性能下降。以前的方法试图通过在 epoch 和层之间改变学习率和 batch size,或者对批归一化进行特定的修改来解决这个问题。我们提出了可扩展且实用的自然梯度下降(SP-NGD),这是一种训练模型的原则性方法,它可以使模型达到与使用一阶优化方法训练的模型相似的泛化性能,但收敛速度更快。此外,与一阶方法相比,SP-NGD 在大规模 mini-batch 大小下的计算开销可以忽略不计。我们在一个基准任务上评估了 SP-NGD,该任务提供了高度优化的一阶方法作为参考:在 ImageNet 上对 ResNet-50 模型进行图像分类训练。我们在使用 1,024 个 GPU 时,使用 32,768 的 mini-batch 大小,在 5.5 分钟内收敛到 75.4%的验证准确率,在 873 步的 SP-NGD 中,使用 131,072 的超大 mini-batch 大小,达到 74.9%的准确率。