Suppr超能文献

随机梯度下降法引入了一种有效的依赖于景观的正则化方法,有利于平坦解。

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions.

机构信息

Peking-Tsinghua Center for Life Science, Peking University, Beijing 100871, China.

Center for Quantitative Biology, Peking University, Beijing 100871, China.

出版信息

Phys Rev Lett. 2023 Jun 9;130(23):237101. doi: 10.1103/PhysRevLett.130.237101.

Abstract

Generalization is one of the most important problems in deep learning, where there exist many low-loss solutions due to overparametrization. Previous empirical studies showed a strong correlation between flatness of the loss landscape at a solution and its generalizability, and stochastic gradient descent (SGD) is crucial in finding the flat solutions. To understand the effects of SGD, we construct a simple model whose overall loss landscape has a continuous set of degenerate (or near-degenerate) minima and the loss landscape for a minibatch is approximated by a random shift of the overall loss function. By direct simulations of the stochastic learning dynamics and solving the underlying Fokker-Planck equation, we show that due to its strong anisotropy the SGD noise introduces an additional effective loss term that decreases with flatness and has an overall strength that increases with the learning rate and batch-to-batch variation. We find that the additional landscape-dependent SGD loss breaks the degeneracy and serves as an effective regularization for finding flat solutions. As a result, the flatness of the overall loss landscape increases during learning and reaches a higher value (flatter minimum) for a larger SGD noise strength before the noise strength reaches a critical value when the system fails to converge. These results, which are verified in realistic neural network models, elucidate the role of SGD for generalization, and they may also have important implications for hyperparameter selection for learning efficiently without divergence.

摘要

泛化是深度学习中的一个重要问题,由于过参数化,存在许多低损失的解决方案。之前的经验研究表明,解的损失曲面的平坦度与其泛化能力之间存在很强的相关性,而随机梯度下降(SGD)在寻找平坦解方面起着至关重要的作用。为了了解 SGD 的影响,我们构建了一个简单的模型,其整体损失曲面具有连续的退化(或近退化)极小值集,而小批量的损失曲面则由整体损失函数的随机位移来近似。通过对随机学习动力学的直接模拟和解决潜在的福克-普朗克方程,我们表明,由于其强烈的各向异性,SGD 噪声引入了一个额外的有效损失项,该损失项随平坦度而减小,并且其整体强度随学习率和批量到批量的变化而增加。我们发现,附加的依赖于地形的 SGD 损耗打破了退化,并为寻找平坦解提供了有效的正则化。因此,在学习过程中,整体损失地形的平坦度增加,并且在噪声强度达到系统无法收敛的临界值之前,噪声强度越大,平坦的最小值就越高。这些在现实神经网络模型中得到验证的结果阐明了 SGD 对泛化的作用,并且对于高效学习的超参数选择也可能具有重要意义,而不会导致发散。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验