School of Electrical Engineering and Computer Engineering, The Pennsylvania State University, State College, PA, 16802, USA.
Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA.
Neural Netw. 2022 Aug;152:499-509. doi: 10.1016/j.neunet.2022.05.016. Epub 2022 May 21.
Large neural networks usually perform well for executing machine learning tasks. However, models that achieve state-of-the-art performance involve arbitrarily large number of parameters and therefore their training is very expensive. It is thus desired to implement methods with small per-iteration costs, fast convergence rates, and reduced tuning. This paper proposes a multivariate adaptive gradient descent method that meets the above attributes. The proposed method updates every element of the model parameters separately in a computationally efficient manner using an adaptive vector-form learning rate, resulting in low per-iteration cost. The adaptive learning rate computes the absolute difference of current and previous model parameters over the difference in subgradients of current and previous state estimates. In the deterministic setting, we show that the cost function value converges at a linear rate for smooth and strongly convex cost functions. Whereas in both the deterministic and stochastic setting, we show that the gradient converges in expectation at the order of O(1/k) for a non-convex cost function with Lipschitz continuous gradient. In addition, we show that after T iterates, the cost function of the last iterate scales as O(log(T)/T) for non-smooth strongly convex cost functions. Effectiveness of the proposed method is validated on convex functions, smooth non-convex function, non-smooth convex function, and four image classification data sets, whilst showing that its execution requires hardly any tuning unlike existing popular optimizers that entail relatively large tuning efforts. Our empirical results show that our proposed algorithm provides the best overall performance when comparing it to tuned state-of-the-art optimizers.
大型神经网络通常在执行机器学习任务方面表现出色。然而,实现最先进性能的模型涉及任意数量的参数,因此它们的训练非常昂贵。因此,人们希望实现具有小迭代成本、快速收敛速度和减少调优的方法。本文提出了一种满足上述属性的多元自适应梯度下降方法。该方法使用自适应向量形式的学习率以计算效率的方式分别更新模型参数的每个元素,从而降低了迭代成本。自适应学习率通过计算当前和以前状态估计的子梯度之间的差异,计算当前和以前模型参数之间的绝对差异。在确定性设置中,我们表明对于光滑和强凸的成本函数,成本函数值以线性速率收敛。而在确定性和随机设置中,我们表明对于具有 Lipschitz 连续梯度的非凸成本函数,梯度以 O(1/k)的阶数在期望中收敛。此外,我们表明对于非光滑强凸成本函数,经过 T 次迭代后,最后一次迭代的成本函数的规模为 O(log(T)/T)。我们在凸函数、光滑非凸函数、非光滑凸函数以及四个图像分类数据集上验证了所提出方法的有效性,同时表明与需要相对较大调优工作的现有流行优化器不同,它的执行几乎不需要任何调优。我们的实证结果表明,与经过调优的最先进优化器相比,我们提出的算法提供了最佳的整体性能。