Lei Yunwen, Tang Ke
IEEE Trans Pattern Anal Mach Intell. 2021 Dec;43(12):4505-4511. doi: 10.1109/TPAMI.2021.3068154. Epub 2021 Nov 3.
Stochastic gradient descent (SGD) has become the method of choice for training highly complex and nonconvex models since it can not only recover good solutions to minimize training errors but also generalize well. Computational and statistical properties are separately studied to understand the behavior of SGD in the literature. However, there is a lacking study to jointly consider the computational and statistical properties in a nonconvex learning setting. In this paper, we develop novel learning rates of SGD for nonconvex learning by presenting high-probability bounds for both computational and statistical errors. We show that the complexity of SGD iterates grows in a controllable manner with respect to the iteration number, which sheds insights on how an implicit regularization can be achieved by tuning the number of passes to balance the computational and statistical errors. As a byproduct, we also slightly refine the existing studies on the uniform convergence of gradients by showing its connection to Rademacher chaos complexities.
随机梯度下降(SGD)已成为训练高度复杂和非凸模型的首选方法,因为它不仅可以找到使训练误差最小化的良好解决方案,而且泛化能力也很强。在文献中,人们分别研究了计算属性和统计属性,以了解SGD的行为。然而,在非凸学习环境中,缺乏对计算属性和统计属性进行联合考虑的研究。在本文中,我们通过给出计算误差和统计误差的高概率界,为非凸学习开发了新颖的SGD学习率。我们表明,SGD迭代的复杂度相对于迭代次数以可控的方式增长,这为如何通过调整迭代次数来平衡计算误差和统计误差以实现隐式正则化提供了见解。作为一个副产品,我们还通过展示梯度的一致收敛与拉德马赫混沌复杂度的联系,对现有的关于梯度一致收敛的研究进行了轻微改进。