Jia Xixi, Feng Xiangchu, Yong Hongwei, Meng Deyu
IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6936-6947. doi: 10.1109/TNNLS.2022.3213536. Epub 2024 May 2.
Weight decay (WD) is a fundamental and practical regularization technique in improving generalization of current deep learning models. However, it is observed that the WD does not work effectively for an adaptive optimization algorithm (such as Adam), as it works for SGD. Specifically, the solution found by Adam with the WD often generalizes unsatisfactorily. Though efforts have been made to mitigate this issue, the reason for such deficiency is still vague. In this article, we first show that when using the Adam optimizer, the weight norm increases very fast along with the training procedure, which is in contrast to SGD where the weight norm increases relatively slower and tends to converge. The fast increase of weight norm is adverse to WD; in consequence, the Adam optimizer will lose efficacy in finding solution that generalizes well. To resolve this problem, we propose to tailor Adam by introducing a regularization term on the adaptive learning rate, such that it is friendly to WD. Meanwhile, we introduce first moment on the WD to further enhance the regularization effect. We show that the proposed method is able to find solution with small norm and generalizes better than SGD. We test the proposed method on general image classification and fine-grained image classification tasks with different networks. Experimental results on all these cases substantiate the effectiveness of the proposed method in help improving the generalization. Specifically, the proposed method improves the test accuracy of Adam by a large margin and even improves the performance of SGD by 0.84% on CIFAR 10 and 1.03 % on CIFAR 100 with ResNet-50. The code of this article is public available at xxx.
权重衰减(WD)是一种用于提高当前深度学习模型泛化能力的基本且实用的正则化技术。然而,据观察,WD在自适应优化算法(如Adam)中并不像在随机梯度下降(SGD)中那样有效工作。具体而言,Adam结合WD找到的解决方案通常泛化效果不佳。尽管已经努力缓解这个问题,但这种不足的原因仍然不明确。在本文中,我们首先表明,在使用Adam优化器时,权重范数会随着训练过程快速增加,这与SGD中权重范数增加相对较慢且趋于收敛形成对比。权重范数的快速增加对WD不利;因此,Adam优化器在寻找泛化良好的解决方案时会失去效力。为了解决这个问题,我们建议通过在自适应学习率上引入正则化项来调整Adam,使其对WD友好。同时,我们在WD上引入一阶矩以进一步增强正则化效果。我们表明,所提出的方法能够找到范数较小的解决方案,并且比SGD泛化得更好。我们在不同网络的通用图像分类和细粒度图像分类任务上测试了所提出的方法。所有这些情况下的实验结果证实了所提出的方法在帮助提高泛化能力方面的有效性。具体而言,所提出的方法大幅提高了Adam的测试准确率,甚至在使用ResNet - 50的情况下,在CIFAR 10上使SGD的性能提高了0.84%,在CIFAR 100上提高了1.03%。本文代码可在xxx公开获取。