基于尺度不变权重的定制Adam优化器的权重衰减以实现更好的泛化。

Weight Decay With Tailored Adam on Scale-Invariant Weights for Better Generalization.

作者信息

Jia Xixi, Feng Xiangchu, Yong Hongwei, Meng Deyu

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6936-6947. doi: 10.1109/TNNLS.2022.3213536. Epub 2024 May 2.

DOI:10.1109/TNNLS.2022.3213536

Abstract

Weight decay (WD) is a fundamental and practical regularization technique in improving generalization of current deep learning models. However, it is observed that the WD does not work effectively for an adaptive optimization algorithm (such as Adam), as it works for SGD. Specifically, the solution found by Adam with the WD often generalizes unsatisfactorily. Though efforts have been made to mitigate this issue, the reason for such deficiency is still vague. In this article, we first show that when using the Adam optimizer, the weight norm increases very fast along with the training procedure, which is in contrast to SGD where the weight norm increases relatively slower and tends to converge. The fast increase of weight norm is adverse to WD; in consequence, the Adam optimizer will lose efficacy in finding solution that generalizes well. To resolve this problem, we propose to tailor Adam by introducing a regularization term on the adaptive learning rate, such that it is friendly to WD. Meanwhile, we introduce first moment on the WD to further enhance the regularization effect. We show that the proposed method is able to find solution with small norm and generalizes better than SGD. We test the proposed method on general image classification and fine-grained image classification tasks with different networks. Experimental results on all these cases substantiate the effectiveness of the proposed method in help improving the generalization. Specifically, the proposed method improves the test accuracy of Adam by a large margin and even improves the performance of SGD by 0.84% on CIFAR 10 and 1.03 % on CIFAR 100 with ResNet-50. The code of this article is public available at xxx.

摘要

权重衰减（WD）是一种用于提高当前深度学习模型泛化能力的基本且实用的正则化技术。然而，据观察，WD在自适应优化算法（如Adam）中并不像在随机梯度下降（SGD）中那样有效工作。具体而言，Adam结合WD找到的解决方案通常泛化效果不佳。尽管已经努力缓解这个问题，但这种不足的原因仍然不明确。在本文中，我们首先表明，在使用Adam优化器时，权重范数会随着训练过程快速增加，这与SGD中权重范数增加相对较慢且趋于收敛形成对比。权重范数的快速增加对WD不利；因此，Adam优化器在寻找泛化良好的解决方案时会失去效力。为了解决这个问题，我们建议通过在自适应学习率上引入正则化项来调整Adam，使其对WD友好。同时，我们在WD上引入一阶矩以进一步增强正则化效果。我们表明，所提出的方法能够找到范数较小的解决方案，并且比SGD泛化得更好。我们在不同网络的通用图像分类和细粒度图像分类任务上测试了所提出的方法。所有这些情况下的实验结果证实了所提出的方法在帮助提高泛化能力方面的有效性。具体而言，所提出的方法大幅提高了Adam的测试准确率，甚至在使用ResNet - 50的情况下，在CIFAR 10上使SGD的性能提高了0.84%，在CIFAR 100上提高了1.03%。本文代码可在xxx公开获取。

相似文献

Weight Decay With Tailored Adam on Scale-Invariant Weights for Better Generalization.

IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6936-6947. doi: 10.1109/TNNLS.2022.3213536. Epub 2024 May 2.

A novel adaptive cubic quasi-Newton optimizer for deep learning based medical image analysis tasks, validated on detection of COVID-19 and segmentation for COVID-19 lung infection, liver tumor, and optic disc/cup.

Med Phys. 2023 Mar;50(3):1528-1538. doi: 10.1002/mp.15969. Epub 2022 Oct 6.

Towards Understanding Convergence and Generalization of AdamW.

IEEE Trans Pattern Anal Mach Intell. 2024 Sep;46(9):6486-6493. doi: 10.1109/TPAMI.2024.3382294. Epub 2024 Aug 6.

Stochastic Mirror Descent on Overparameterized Nonlinear Models.

IEEE Trans Neural Netw Learn Syst. 2022 Dec;33(12):7717-7727. doi: 10.1109/TNNLS.2021.3087480. Epub 2022 Nov 30.

The effect of choosing optimizer algorithms to improve computer vision tasks: a comparative study.

Multimed Tools Appl. 2023;82(11):16591-16633. doi: 10.1007/s11042-022-13820-0. Epub 2022 Sep 28.

A systematic evaluation of learning rate policies in training CNNs for brain tumor segmentation.

Phys Med Biol. 2021 May 4;66(10). doi: 10.1088/1361-6560/abe3d3.

The WuC-Adam algorithm based on joint improvement of Warmup and cosine annealing algorithms.

Math Biosci Eng. 2024 Jan;21(1):1270-1285. doi: 10.3934/mbe.2024054. Epub 2022 Dec 26.

XGrad: Boosting Gradient-Based Optimizers With Weight Prediction.

IEEE Trans Pattern Anal Mach Intell. 2024 Oct;46(10):6731-6747. doi: 10.1109/TPAMI.2024.3387399. Epub 2024 Sep 5.

Improved Residual Network based on norm-preservation for visual recognition.

Neural Netw. 2023 Jan;157:305-322. doi: 10.1016/j.neunet.2022.10.023. Epub 2022 Oct 28.

Improving Generalization Based on -Norm Regularization for EEG-Based Motor Imagery Classification.

Front Neurosci. 2018 May 9;12:272. doi: 10.3389/fnins.2018.00272. eCollection 2018.

引用本文的文献

A Novel Black Widow Optimization Algorithm Based on Lagrange Interpolation Operator for ResNet18.

Biomimetics (Basel). 2025 Jun 3;10(6):361. doi: 10.3390/biomimetics10060361.

High accuracy inverse design of reconfigurable metasurfaces with transmission-reflection-integrated achromatic functionalities.

Nanophotonics. 2025 Mar 25;14(7):921-934. doi: 10.1515/nanoph-2024-0680. eCollection 2025 Apr.

Fruits hidden by green: an improved YOLOV8n for detection of young citrus in lush citrus trees.

Front Plant Sci. 2024 Apr 10;15:1375118. doi: 10.3389/fpls.2024.1375118. eCollection 2024.

Identification of Atrial Fibrillation With Single-Lead Mobile ECG During Normal Sinus Rhythm Using Deep Learning.

J Korean Med Sci. 2024 Feb 5;39(5):e56. doi: 10.3346/jkms.2024.39.e56.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于尺度不变权重的定制Adam优化器的权重衰减以实现更好的泛化。

Weight Decay With Tailored Adam on Scale-Invariant Weights for Better Generalization.

作者信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献