迈向理解AdamW的收敛性与泛化能力

Towards Understanding Convergence and Generalization of AdamW.

作者信息

Zhou Pan, Xie Xingyu, Lin Zhouchen, Yan Shuicheng

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Sep;46(9):6486-6493. doi: 10.1109/TPAMI.2024.3382294. Epub 2024 Aug 6.

DOI:10.1109/TPAMI.2024.3382294

Abstract

AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used l-regularizer which changes optimization steps via changing the first- and second-order gradient moments. Despite its great practical success, for AdamW, its convergence behavior and generalization improvement over Adam and l-regularized Adam ( l-Adam) remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and l-Adam. Specifically, AdamW provably converges but minimizes a dynamically regularized loss that combines vanilla loss and a dynamical regularization induced by decoupled weight decay, thus yielding different behaviors with Adam and l-Adam. Moreover, on both general nonconvex problems and PŁ-conditioned problems, we establish stochastic gradient complexity of AdamW to find a stationary point. Such complexity is also applicable to Adam and l-Adam, and improves their previously known complexity, especially for over-parametrized networks. Besides, we prove that AdamW enjoys smaller generalization errors than Adam and l-Adam from the Bayesian posterior aspect. This result, for the first time, explicitly reveals the benefits of decoupled weight decay in AdamW. Experimental results validate our theory.

摘要

AdamW通过在每次训练迭代中添加解耦权重衰减来修改Adam，以衰减网络权重。对于自适应算法，这种解耦权重衰减不会影响特定的优化步骤，并且与广泛使用的l正则化器不同，后者通过改变一阶和二阶梯度矩来改变优化步骤。尽管AdamW在实际应用中取得了巨大成功，但其收敛行为以及相对于Adam和l正则化Adam（l-Adam）在泛化能力提升方面仍未得到研究。为了解决这个问题，我们证明了AdamW的收敛性，并阐述了其相对于Adam和l-Adam在泛化方面的优势。具体而言，AdamW可证明是收敛的，但它最小化的是一个动态正则化损失，该损失结合了原始损失和解耦权重衰减所诱导的动态正则化，因此与Adam和l-Adam产生了不同的行为。此外，在一般的非凸问题和满足PŁ条件的问题上，我们建立了AdamW找到驻点的随机梯度复杂度。这种复杂度也适用于Adam和l-Adam，并改进了它们先前已知的复杂度，特别是对于过参数化网络。此外，我们从贝叶斯后验的角度证明，AdamW的泛化误差比Adam和l-Adam更小。这一结果首次明确揭示了解耦权重衰减在AdamW中的优势。实验结果验证了我们的理论。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

迈向理解AdamW的收敛性与泛化能力

Towards Understanding Convergence and Generalization of AdamW.

作者信息

出版信息

相似文献

引用本文的文献

迈向理解AdamW的收敛性与泛化能力

Towards Understanding Convergence and Generalization of AdamW.

作者信息

出版信息

相似文献

引用本文的文献