Niu Yue, Fabian Zalan, Lee Sunwoo, Soltanolkotabi Mahdi, Avestimehr Salman
Department of Electrical and Computer Engineering, University of Southern California.
Department of Computer Science and Engineering, Inha University.
Transact Mach Learn Res. 2023 Aug;2023.
Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training. A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training. So far, attempts that adapt L-BFGS to large-scale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wall-clock time. In this paper, we propose mL-BFGS, a lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) optimization. mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing convergence during stochastic optimization. For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing nodes. We provide a supporting convergence analysis for mL-BFGS in stochastic settings. To investigate mL-BFGS's potential in large-scale DNN training, we train benchmark neural models using mL-BFGS and compare performance with baselines (SGD, Adam, and other quasi-Newton methods). Results show that mL-BFGS achieves both noticeable iteration-wise and wall-clock speedup.
由于在与海森矩阵相关的计算中存在额外的计算成本以及随机训练中的不稳定性问题,拟牛顿法在训练大规模神经网络时仍面临重大挑战。一种著名的方法L-BFGS,它利用历史参数和梯度变化有效地近似海森矩阵,但在随机训练中存在收敛不稳定性。到目前为止,将L-BFGS应用于大规模随机训练的尝试会产生相当大的额外开销,这抵消了其在实际运行时间上的收敛优势。在本文中,我们提出了mL-BFGS,一种基于动量的轻量级L-BFGS算法,为大规模分布式深度神经网络(DNN)优化中的拟牛顿(QN)方法铺平了道路。mL-BFGS在L-BFGS更新中引入了一种几乎无成本的动量方案,大大降低了海森矩阵中的随机噪声,从而在随机优化过程中稳定了收敛。对于大规模的模型训练,mL-BFGS近似一个分块海森矩阵,从而能够在所有计算节点之间分配计算和内存成本。我们为mL-BFGS在随机设置下提供了支持性的收敛性分析。为了研究mL-BFGS在大规模DNN训练中的潜力,我们使用mL-BFGS训练基准神经模型,并与基线方法(SGD、Adam和其他拟牛顿方法)比较性能。结果表明,mL-BFGS在迭代次数和实际运行时间上都实现了显著加速。