随机梯度下降法引入了一种有效的依赖于景观的正则化方法，有利于平坦解。

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions.

机构信息

Peking-Tsinghua Center for Life Science, Peking University, Beijing 100871, China.

Center for Quantitative Biology, Peking University, Beijing 100871, China.

出版信息

Phys Rev Lett. 2023 Jun 9;130(23):237101. doi: 10.1103/PhysRevLett.130.237101.

DOI:10.1103/PhysRevLett.130.237101

PMID:37354404

Abstract

Generalization is one of the most important problems in deep learning, where there exist many low-loss solutions due to overparametrization. Previous empirical studies showed a strong correlation between flatness of the loss landscape at a solution and its generalizability, and stochastic gradient descent (SGD) is crucial in finding the flat solutions. To understand the effects of SGD, we construct a simple model whose overall loss landscape has a continuous set of degenerate (or near-degenerate) minima and the loss landscape for a minibatch is approximated by a random shift of the overall loss function. By direct simulations of the stochastic learning dynamics and solving the underlying Fokker-Planck equation, we show that due to its strong anisotropy the SGD noise introduces an additional effective loss term that decreases with flatness and has an overall strength that increases with the learning rate and batch-to-batch variation. We find that the additional landscape-dependent SGD loss breaks the degeneracy and serves as an effective regularization for finding flat solutions. As a result, the flatness of the overall loss landscape increases during learning and reaches a higher value (flatter minimum) for a larger SGD noise strength before the noise strength reaches a critical value when the system fails to converge. These results, which are verified in realistic neural network models, elucidate the role of SGD for generalization, and they may also have important implications for hyperparameter selection for learning efficiently without divergence.

摘要

泛化是深度学习中的一个重要问题，由于过参数化，存在许多低损失的解决方案。之前的经验研究表明，解的损失曲面的平坦度与其泛化能力之间存在很强的相关性，而随机梯度下降（SGD）在寻找平坦解方面起着至关重要的作用。为了了解 SGD 的影响，我们构建了一个简单的模型，其整体损失曲面具有连续的退化（或近退化）极小值集，而小批量的损失曲面则由整体损失函数的随机位移来近似。通过对随机学习动力学的直接模拟和解决潜在的福克-普朗克方程，我们表明，由于其强烈的各向异性，SGD 噪声引入了一个额外的有效损失项，该损失项随平坦度而减小，并且其整体强度随学习率和批量到批量的变化而增加。我们发现，附加的依赖于地形的 SGD 损耗打破了退化，并为寻找平坦解提供了有效的正则化。因此，在学习过程中，整体损失地形的平坦度增加，并且在噪声强度达到系统无法收敛的临界值之前，噪声强度越大，平坦的最小值就越高。这些在现实神经网络模型中得到验证的结果阐明了 SGD 对泛化的作用，并且对于高效学习的超参数选择也可能具有重要意义，而不会导致发散。

相似文献

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions.

Phys Rev Lett. 2023 Jun 9;130(23):237101. doi: 10.1103/PhysRevLett.130.237101.

The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima.

Proc Natl Acad Sci U S A. 2021 Mar 2;118(9). doi: 10.1073/pnas.2015617118.

Anomalous diffusion dynamics of learning in deep neural networks.

Neural Netw. 2022 May;149:18-28. doi: 10.1016/j.neunet.2022.01.019. Epub 2022 Feb 3.

Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling.

IEEE Trans Neural Netw Learn Syst. 2020 Nov;31(11):4649-4659. doi: 10.1109/TNNLS.2019.2957003. Epub 2020 Oct 29.

A mean field view of the landscape of two-layer neural networks.

Proc Natl Acad Sci U S A. 2018 Aug 14;115(33):E7665-E7671. doi: 10.1073/pnas.1806579115. Epub 2018 Jul 27.

Stochastic Mirror Descent on Overparameterized Nonlinear Models.

IEEE Trans Neural Netw Learn Syst. 2022 Dec;33(12):7717-7727. doi: 10.1109/TNNLS.2021.3087480. Epub 2022 Nov 30.

Shaping the learning landscape in neural networks around wide flat minima.

Proc Natl Acad Sci U S A. 2020 Jan 7;117(1):161-170. doi: 10.1073/pnas.1908636117. Epub 2019 Dec 23.

Towards Better Generalization of Deep Neural Networks via Non-Typicality Sampling Scheme.

IEEE Trans Neural Netw Learn Syst. 2023 Oct;34(10):7910-7920. doi: 10.1109/TNNLS.2022.3147031. Epub 2023 Oct 5.

Accelerating deep neural network training with inconsistent stochastic gradient descent.

Neural Netw. 2017 Sep;93:219-229. doi: 10.1016/j.neunet.2017.06.003. Epub 2017 Jun 16.

The Limiting Dynamics of SGD: Modified Loss, Phase-Space Oscillations, and Anomalous Diffusion.

Neural Comput. 2023 Dec 12;36(1):151-174. doi: 10.1162/neco_a_01626.

引用本文的文献

Optimization on multifractal loss landscapes explains a diverse range of geometrical and dynamical properties of deep learning.

Nat Commun. 2025 Apr 5;16(1):3252. doi: 10.1038/s41467-025-58532-9.

Aligned and oblique dynamics in recurrent neural networks.

Elife. 2024 Nov 27;13:RP93060. doi: 10.7554/eLife.93060.

Machine learning meets physics: A two-way street.

Proc Natl Acad Sci U S A. 2024 Jul 2;121(27):e2403580121. doi: 10.1073/pnas.2403580121. Epub 2024 Jun 24.

Representational drift as a result of implicit regularization.

Elife. 2024 May 2;12:RP90069. doi: 10.7554/eLife.90069.

On the different regimes of stochastic gradient descent.

Proc Natl Acad Sci U S A. 2024 Feb 27;121(9):e2316301121. doi: 10.1073/pnas.2316301121. Epub 2024 Feb 20.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

随机梯度下降法引入了一种有效的依赖于景观的正则化方法，有利于平坦解。

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献