Lei Yunwen, Hu Ting, Li Guiying, Tang Ke
IEEE Trans Neural Netw Learn Syst. 2020 Oct;31(10):4394-4400. doi: 10.1109/TNNLS.2019.2952219. Epub 2019 Dec 11.
Stochastic gradient descent (SGD) is a popular and efficient method with wide applications in training deep neural nets and other nonconvex models. While the behavior of SGD is well understood in the convex learning setting, the existing theoretical results for SGD applied to nonconvex objective functions are far from mature. For example, existing results require to impose a nontrivial assumption on the uniform boundedness of gradients for all iterates encountered in the learning process, which is hard to verify in practical implementations. In this article, we establish a rigorous theoretical foundation for SGD in nonconvex learning by showing that this boundedness assumption can be removed without affecting convergence rates, and relaxing the standard smoothness assumption to Hölder continuity of gradients. In particular, we establish sufficient conditions for almost sure convergence as well as optimal convergence rates for SGD applied to both general nonconvex and gradient-dominated objective functions. A linear convergence is further derived in the case with zero variances.
随机梯度下降(SGD)是一种流行且高效的方法,在训练深度神经网络和其他非凸模型中有着广泛应用。虽然在凸学习设置下SGD的行为已得到充分理解,但应用于非凸目标函数的SGD的现有理论结果还远未成熟。例如,现有结果要求对学习过程中遇到的所有迭代的梯度的一致有界性施加一个非平凡假设,这在实际实现中很难验证。在本文中,我们通过表明可以去除这个有界性假设而不影响收敛速度,并将标准的光滑性假设放宽到梯度的赫尔德连续性,为非凸学习中的SGD建立了严格的理论基础。特别是,我们为应用于一般非凸和梯度主导目标函数的SGD建立了几乎必然收敛的充分条件以及最优收敛速度。在方差为零的情况下进一步推导出线性收敛。