Suppr超能文献

迈向理解AdamW的收敛性与泛化能力

Towards Understanding Convergence and Generalization of AdamW.

作者信息

Zhou Pan, Xie Xingyu, Lin Zhouchen, Yan Shuicheng

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Sep;46(9):6486-6493. doi: 10.1109/TPAMI.2024.3382294. Epub 2024 Aug 6.

Abstract

AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used l-regularizer which changes optimization steps via changing the first- and second-order gradient moments. Despite its great practical success, for AdamW, its convergence behavior and generalization improvement over Adam and l-regularized Adam ( l-Adam) remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and l-Adam. Specifically, AdamW provably converges but minimizes a dynamically regularized loss that combines vanilla loss and a dynamical regularization induced by decoupled weight decay, thus yielding different behaviors with Adam and l-Adam. Moreover, on both general nonconvex problems and PŁ-conditioned problems, we establish stochastic gradient complexity of AdamW to find a stationary point. Such complexity is also applicable to Adam and l-Adam, and improves their previously known complexity, especially for over-parametrized networks. Besides, we prove that AdamW enjoys smaller generalization errors than Adam and l-Adam from the Bayesian posterior aspect. This result, for the first time, explicitly reveals the benefits of decoupled weight decay in AdamW. Experimental results validate our theory.

摘要

AdamW通过在每次训练迭代中添加解耦权重衰减来修改Adam,以衰减网络权重。对于自适应算法,这种解耦权重衰减不会影响特定的优化步骤,并且与广泛使用的l正则化器不同,后者通过改变一阶和二阶梯度矩来改变优化步骤。尽管AdamW在实际应用中取得了巨大成功,但其收敛行为以及相对于Adam和l正则化Adam(l-Adam)在泛化能力提升方面仍未得到研究。为了解决这个问题,我们证明了AdamW的收敛性,并阐述了其相对于Adam和l-Adam在泛化方面的优势。具体而言,AdamW可证明是收敛的,但它最小化的是一个动态正则化损失,该损失结合了原始损失和解耦权重衰减所诱导的动态正则化,因此与Adam和l-Adam产生了不同的行为。此外,在一般的非凸问题和满足PŁ条件的问题上,我们建立了AdamW找到驻点的随机梯度复杂度。这种复杂度也适用于Adam和l-Adam,并改进了它们先前已知的复杂度,特别是对于过参数化网络。此外,我们从贝叶斯后验的角度证明,AdamW的泛化误差比Adam和l-Adam更小。这一结果首次明确揭示了解耦权重衰减在AdamW中的优势。实验结果验证了我们的理论。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验