Jiang Yiming, Liu Jinlan, Xu Dongpo, Mandic Danilo P
Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun 130024, China
Department of Mathematics, Changchun Normal University, Changchun 130032, China
Neural Comput. 2024 Aug 19;36(9):1912-1938. doi: 10.1162/neco_a_01692.
Adam-type algorithms have become a preferred choice for optimization in the deep learning setting; however, despite their success, their convergence is still not well understood. To this end, we introduce a unified framework for Adam-type algorithms, termed UAdam. It is equipped with a general form of the second-order moment, which makes it possible to include Adam and its existing and future variants as special cases, such as NAdam, AMSGrad, AdaBound, AdaFom, and Adan. The approach is supported by a rigorous convergence analysis of UAdam in the general nonconvex stochastic setting, showing that UAdam converges to the neighborhood of stationary points with a rate of O(1/T). Furthermore, the size of the neighborhood decreases as the parameter β1 increases. Importantly, our analysis only requires the first-order momentum factor to be close enough to 1, without any restrictions on the second-order momentum factor. Theoretical results also reveal the convergence conditions of vanilla Adam, together with the selection of appropriate hyperparameters. This provides a theoretical guarantee for the analysis, applications, and further developments of the whole general class of Adam-type algorithms. Finally, several numerical experiments are provided to support our theoretical findings.
亚当型算法已成为深度学习优化中的首选;然而,尽管它们取得了成功,但其收敛性仍未得到很好的理解。为此,我们引入了一个用于亚当型算法的统一框架,称为UAdam。它配备了二阶矩的一般形式,这使得将亚当及其现有和未来的变体作为特殊情况包括在内成为可能,例如NAdam、AMSGrad、AdaBound、AdaFom和Adan。该方法得到了UAdam在一般非凸随机环境下严格收敛分析的支持,表明UAdam以O(1/T)的速率收敛到驻点邻域。此外,邻域的大小随着参数β1的增加而减小。重要的是,我们的分析只要求一阶动量因子足够接近1,而对二阶动量因子没有任何限制。理论结果还揭示了原始亚当的收敛条件以及合适超参数的选择。这为整个亚当型算法的分析、应用和进一步发展提供了理论保证。最后,提供了几个数值实验来支持我们的理论发现。