• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

迈向理解AdamW的收敛性与泛化能力

Towards Understanding Convergence and Generalization of AdamW.

作者信息

Zhou Pan, Xie Xingyu, Lin Zhouchen, Yan Shuicheng

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Sep;46(9):6486-6493. doi: 10.1109/TPAMI.2024.3382294. Epub 2024 Aug 6.

DOI:10.1109/TPAMI.2024.3382294
PMID:38536692
Abstract

AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used l-regularizer which changes optimization steps via changing the first- and second-order gradient moments. Despite its great practical success, for AdamW, its convergence behavior and generalization improvement over Adam and l-regularized Adam ( l-Adam) remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and l-Adam. Specifically, AdamW provably converges but minimizes a dynamically regularized loss that combines vanilla loss and a dynamical regularization induced by decoupled weight decay, thus yielding different behaviors with Adam and l-Adam. Moreover, on both general nonconvex problems and PŁ-conditioned problems, we establish stochastic gradient complexity of AdamW to find a stationary point. Such complexity is also applicable to Adam and l-Adam, and improves their previously known complexity, especially for over-parametrized networks. Besides, we prove that AdamW enjoys smaller generalization errors than Adam and l-Adam from the Bayesian posterior aspect. This result, for the first time, explicitly reveals the benefits of decoupled weight decay in AdamW. Experimental results validate our theory.

摘要

AdamW通过在每次训练迭代中添加解耦权重衰减来修改Adam,以衰减网络权重。对于自适应算法,这种解耦权重衰减不会影响特定的优化步骤,并且与广泛使用的l正则化器不同,后者通过改变一阶和二阶梯度矩来改变优化步骤。尽管AdamW在实际应用中取得了巨大成功,但其收敛行为以及相对于Adam和l正则化Adam(l-Adam)在泛化能力提升方面仍未得到研究。为了解决这个问题,我们证明了AdamW的收敛性,并阐述了其相对于Adam和l-Adam在泛化方面的优势。具体而言,AdamW可证明是收敛的,但它最小化的是一个动态正则化损失,该损失结合了原始损失和解耦权重衰减所诱导的动态正则化,因此与Adam和l-Adam产生了不同的行为。此外,在一般的非凸问题和满足PŁ条件的问题上,我们建立了AdamW找到驻点的随机梯度复杂度。这种复杂度也适用于Adam和l-Adam,并改进了它们先前已知的复杂度,特别是对于过参数化网络。此外,我们从贝叶斯后验的角度证明,AdamW的泛化误差比Adam和l-Adam更小。这一结果首次明确揭示了解耦权重衰减在AdamW中的优势。实验结果验证了我们的理论。

相似文献

1
Towards Understanding Convergence and Generalization of AdamW.迈向理解AdamW的收敛性与泛化能力
IEEE Trans Pattern Anal Mach Intell. 2024 Sep;46(9):6486-6493. doi: 10.1109/TPAMI.2024.3382294. Epub 2024 Aug 6.
2
Weight Decay With Tailored Adam on Scale-Invariant Weights for Better Generalization.基于尺度不变权重的定制Adam优化器的权重衰减以实现更好的泛化。
IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6936-6947. doi: 10.1109/TNNLS.2022.3213536. Epub 2024 May 2.
3
XGrad: Boosting Gradient-Based Optimizers With Weight Prediction.XGrad:通过权重预测增强基于梯度的优化器
IEEE Trans Pattern Anal Mach Intell. 2024 Oct;46(10):6731-6747. doi: 10.1109/TPAMI.2024.3387399. Epub 2024 Sep 5.
4
UAdam: Unified Adam-Type Algorithmic Framework for Nonconvex Optimization.UAdam:用于非凸优化的统一Adam型算法框架。
Neural Comput. 2024 Aug 19;36(9):1912-1938. doi: 10.1162/neco_a_01692.
5
AdaCN: An Adaptive Cubic Newton Method for Nonconvex Stochastic Optimization.AdaCN:一种用于非凸随机优化的自适应三次牛顿方法。
Comput Intell Neurosci. 2021 Nov 10;2021:5790608. doi: 10.1155/2021/5790608. eCollection 2021.
6
Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM.校准自适应学习率以提高ADAM算法的收敛性。
Neurocomputing (Amst). 2022 Apr 7;481:333-356. doi: 10.1016/j.neucom.2022.01.014. Epub 2022 Jan 21.
7
Convergence of the RMSProp deep learning method with penalty for nonconvex optimization.RMSProp 深度学习方法与非凸优化惩罚项的收敛性。
Neural Netw. 2021 Jul;139:17-23. doi: 10.1016/j.neunet.2021.02.011. Epub 2021 Feb 23.
8
The WuC-Adam algorithm based on joint improvement of Warmup and cosine annealing algorithms.基于热身算法和余弦退火算法联合改进的WuC-Adam算法。
Math Biosci Eng. 2024 Jan;21(1):1270-1285. doi: 10.3934/mbe.2024054. Epub 2022 Dec 26.
9
Convergence analysis of AdaBound with relaxed bound functions for non-convex optimization.AdaBound 与松弛边界函数在非凸优化中的收敛性分析。
Neural Netw. 2022 Jan;145:300-307. doi: 10.1016/j.neunet.2021.10.026. Epub 2021 Nov 8.
10
A Hybrid Stochastic-Deterministic Minibatch Proximal Gradient Method for Efficient Optimization and Generalization.
IEEE Trans Pattern Anal Mach Intell. 2021 Jun 8;PP. doi: 10.1109/TPAMI.2021.3087328.

引用本文的文献

1
Prediction of the ectasia screening index from raw Casia2 volume data for keratoconus identification by using convolutional neural networks.利用卷积神经网络从原始Casia2体积数据预测圆锥角膜识别的扩张筛查指数。
PLoS One. 2025 Sep 2;20(9):e0311036. doi: 10.1371/journal.pone.0311036. eCollection 2025.
2
Prediction of functional outcomes in aneurysmal subarachnoid hemorrhage using pre-/postoperative noncontrast CT within 3 days of admission.入院3天内使用术前/术后非增强CT预测动脉瘤性蛛网膜下腔出血的功能预后
NPJ Digit Med. 2025 Aug 24;8(1):542. doi: 10.1038/s41746-025-01953-z.
3
A truth inference scheme for crowdsourcing using NLP and swin transformers.
一种使用自然语言处理和Swin变压器进行众包的真值推理方案。
Sci Rep. 2025 Aug 4;15(1):28338. doi: 10.1038/s41598-025-10942-x.
4
Equity-enhanced glaucoma progression prediction from OCT with knowledge distillation.通过知识蒸馏从光学相干断层扫描(OCT)中增强公平性的青光眼进展预测
NPJ Digit Med. 2025 Jul 24;8(1):477. doi: 10.1038/s41746-025-01884-9.
5
HiC4D-SPOT: a spatiotemporal outlier detection tool for Hi-C data.HiC4D-SPOT:一种用于Hi-C数据的时空异常检测工具。
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf341.
6
Lightweight Brain Tumor Segmentation Through Wavelet-Guided Iterative Axial Factorization Attention.基于小波引导的迭代轴向因子分解注意力的轻量级脑肿瘤分割
Brain Sci. 2025 Jun 6;15(6):613. doi: 10.3390/brainsci15060613.
7
Deep Learning-Based Ground-Penetrating Radar Inversion for Tree Roots in Heterogeneous Soil.基于深度学习的非均质土壤中树根探地雷达反演
Sensors (Basel). 2025 Feb 5;25(3):947. doi: 10.3390/s25030947.
8
Motion Hologram: Jointly optimized hologram generation and motion planning for photorealistic 3D displays via reinforcement learning.动态全息图:通过强化学习实现用于逼真3D显示的联合优化全息图生成与运动规划。
Sci Adv. 2025 Jan 31;11(5):eads9876. doi: 10.1126/sciadv.ads9876. Epub 2025 Jan 29.
9
scHiGex: predicting single-cell gene expression based on single-cell Hi-C data.scHiGex:基于单细胞Hi-C数据预测单细胞基因表达
NAR Genom Bioinform. 2025 Jan 27;7(1):lqaf002. doi: 10.1093/nargab/lqaf002. eCollection 2025 Mar.
10
GenoSiS: A Biobank-Scale Genotype Similarity Search Architecture for Creating Dynamic Patient-Match Cohorts.GenoSiS:一种用于创建动态患者匹配队列的生物样本库规模的基因型相似性搜索架构。
bioRxiv. 2024 Nov 3:2024.11.02.621671. doi: 10.1101/2024.11.02.621671.