随机梯度下降中的逆方差-平坦度关系对于找到平坦最小值至关重要。

The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima.

作者信息

Feng Yu, Tu Yuhai

机构信息

Foundations of AI, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598.

Department of Physics, Duke University, Durham, NC 27710.

出版信息

Proc Natl Acad Sci U S A. 2021 Mar 2;118(9). doi: 10.1073/pnas.2015617118.

DOI:10.1073/pnas.2015617118

PMID:33619091

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7936325/

Abstract

Despite tremendous success of the stochastic gradient descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions at flat minima of the loss function in high-dimensional weight space. Here, we investigate the connection between SGD learning dynamics and the loss function landscape. A principal component analysis (PCA) shows that SGD dynamics follow a low-dimensional drift-diffusion motion in the weight space. Around a solution found by SGD, the loss function landscape can be characterized by its flatness in each PCA direction. Remarkably, our study reveals a robust inverse relation between the weight variance and the landscape flatness in all PCA directions, which is the opposite to the fluctuation-response relation (aka Einstein relation) in equilibrium statistical physics. To understand the inverse variance-flatness relation, we develop a phenomenological theory of SGD based on statistical properties of the ensemble of minibatch loss functions. We find that both the anisotropic SGD noise strength (temperature) and its correlation time depend inversely on the landscape flatness in each PCA direction. Our results suggest that SGD serves as a landscape-dependent annealing algorithm. The effective temperature decreases with the landscape flatness so the system seeks out (prefers) flat minima over sharp ones. Based on these insights, an algorithm with landscape-dependent constraints is developed to mitigate catastrophic forgetting efficiently when learning multiple tasks sequentially. In general, our work provides a theoretical framework to understand learning dynamics, which may eventually lead to better algorithms for different learning tasks.

摘要

尽管随机梯度下降（SGD）算法在深度学习中取得了巨大成功，但对于SGD如何在高维权重空间中损失函数的平坦最小值处找到可泛化的解，人们了解甚少。在此，我们研究SGD学习动态与损失函数景观之间的联系。主成分分析（PCA）表明，SGD动态在权重空间中遵循低维漂移扩散运动。在SGD找到的解周围，损失函数景观可以通过其在每个PCA方向上的平坦度来表征。值得注意的是，我们的研究揭示了权重方差与所有PCA方向上的景观平坦度之间存在稳健的反比关系，这与平衡统计物理学中的涨落-响应关系（又名爱因斯坦关系）相反。为了理解方差-平坦度反比关系，我们基于小批量损失函数集合的统计特性，开发了一种SGD的唯象理论。我们发现，各向异性的SGD噪声强度（温度）及其相关时间均与每个PCA方向上的景观平坦度成反比。我们的结果表明，SGD起到了一种依赖于景观的退火算法的作用。有效温度随着景观平坦度的降低而降低，因此系统更倾向于寻找平坦的最小值而非尖锐的最小值。基于这些见解，我们开发了一种具有依赖于景观的约束的算法，以在顺序学习多个任务时有效地减轻灾难性遗忘。总的来说，我们的工作提供了一个理解学习动态的理论框架，这最终可能会带来针对不同学习任务的更好算法。

相似文献

The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima.

Proc Natl Acad Sci U S A. 2021 Mar 2;118(9). doi: 10.1073/pnas.2015617118.

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions.

Phys Rev Lett. 2023 Jun 9;130(23):237101. doi: 10.1103/PhysRevLett.130.237101.

Anomalous diffusion dynamics of learning in deep neural networks.

Neural Netw. 2022 May;149:18-28. doi: 10.1016/j.neunet.2022.01.019. Epub 2022 Feb 3.

A mean field view of the landscape of two-layer neural networks.

Proc Natl Acad Sci U S A. 2018 Aug 14;115(33):E7665-E7671. doi: 10.1073/pnas.1806579115. Epub 2018 Jul 27.

Shaping the learning landscape in neural networks around wide flat minima.

Proc Natl Acad Sci U S A. 2020 Jan 7;117(1):161-170. doi: 10.1073/pnas.1908636117. Epub 2019 Dec 23.

Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling.

IEEE Trans Neural Netw Learn Syst. 2020 Nov;31(11):4649-4659. doi: 10.1109/TNNLS.2019.2957003. Epub 2020 Oct 29.

Understanding Short-Range Memory Effects in Deep Neural Networks.

IEEE Trans Neural Netw Learn Syst. 2024 Aug;35(8):10576-10590. doi: 10.1109/TNNLS.2023.3242969. Epub 2024 Aug 5.

Unveiling the Structure of Wide Flat Minima in Neural Networks.

Phys Rev Lett. 2021 Dec 31;127(27):278301. doi: 10.1103/PhysRevLett.127.278301.

Towards Better Generalization of Deep Neural Networks via Non-Typicality Sampling Scheme.

IEEE Trans Neural Netw Learn Syst. 2023 Oct;34(10):7910-7920. doi: 10.1109/TNNLS.2022.3147031. Epub 2023 Oct 5.

The Limiting Dynamics of SGD: Modified Loss, Phase-Space Oscillations, and Anomalous Diffusion.

Neural Comput. 2023 Dec 12;36(1):151-174. doi: 10.1162/neco_a_01626.

引用本文的文献

Optimization on multifractal loss landscapes explains a diverse range of geometrical and dynamical properties of deep learning.

Nat Commun. 2025 Apr 5;16(1):3252. doi: 10.1038/s41467-025-58532-9.

Temporal Contrastive Learning through implicit non-equilibrium memory.

Nat Commun. 2025 Mar 4;16(1):2163. doi: 10.1038/s41467-025-57043-x.

Machine learning meets physics: A two-way street.

Proc Natl Acad Sci U S A. 2024 Jul 2;121(27):e2403580121. doi: 10.1073/pnas.2403580121. Epub 2024 Jun 24.

One nose but two nostrils: Learn to align with sparse connections between two olfactory cortices.

ArXiv. 2024 May 6:arXiv:2405.03602v1.

Brain-inspired chaotic spiking backpropagation.

Natl Sci Rev. 2024 Jan 30;11(6):nwae037. doi: 10.1093/nsr/nwae037. eCollection 2024 Jun.

The training process of many deep networks explores the same low-dimensional manifold.

Proc Natl Acad Sci U S A. 2024 Mar 19;121(12):e2310002121. doi: 10.1073/pnas.2310002121. Epub 2024 Mar 12.

On the different regimes of stochastic gradient descent.

Proc Natl Acad Sci U S A. 2024 Feb 27;121(9):e2316301121. doi: 10.1073/pnas.2316301121. Epub 2024 Feb 20.

Thermodynamics of the Ising Model Encoded in Restricted Boltzmann Machines.

Entropy (Basel). 2022 Nov 22;24(12):1701. doi: 10.3390/e24121701.

Topology, vorticity, and limit cycle in a stabilized Kuramoto-Sivashinsky equation.

Proc Natl Acad Sci U S A. 2022 Dec 6;119(49):e2211359119. doi: 10.1073/pnas.2211359119. Epub 2022 Dec 2.

Let the robotic games begin.

Proc Natl Acad Sci U S A. 2022 Apr 26;119(17):e2204152119. doi: 10.1073/pnas.2204152119. Epub 2022 Apr 19.

本文引用的文献

High-dimensional dynamics of generalization error in neural networks.

Neural Netw. 2020 Dec;132:428-446. doi: 10.1016/j.neunet.2020.08.022. Epub 2020 Sep 5.

Shaping the learning landscape in neural networks around wide flat minima.

Proc Natl Acad Sci U S A. 2020 Jan 7;117(1):161-170. doi: 10.1073/pnas.1908636117. Epub 2019 Dec 23.

Reconciling modern machine-learning practice and the classical bias-variance trade-off.

Proc Natl Acad Sci U S A. 2019 Aug 6;116(32):15849-15854. doi: 10.1073/pnas.1903070116. Epub 2019 Jul 24.

A mean field view of the landscape of two-layer neural networks.

Proc Natl Acad Sci U S A. 2018 Aug 14;115(33):E7665-E7671. doi: 10.1073/pnas.1806579115. Epub 2018 Jul 27.

Overcoming catastrophic forgetting in neural networks.

Proc Natl Acad Sci U S A. 2017 Mar 28;114(13):3521-3526. doi: 10.1073/pnas.1611835114. Epub 2017 Mar 14.

Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes.

Proc Natl Acad Sci U S A. 2016 Nov 29;113(48):E7655-E7662. doi: 10.1073/pnas.1608103113. Epub 2016 Nov 15.

Deep learning.

Nature. 2015 May 28;521(7553):436-44. doi: 10.1038/nature14539.

Optimization by simulated annealing.

Science. 1983 May 13;220(4598):671-80. doi: 10.1126/science.220.4598.671.

Structure of stochastic dynamics near fixed points.

Proc Natl Acad Sci U S A. 2005 Sep 13;102(37):13029-33. doi: 10.1073/pnas.0506347102. Epub 2005 Sep 1.

Flat minima.

Neural Comput. 1997 Jan 1;9(1):1-42. doi: 10.1162/neco.1997.9.1.1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

随机梯度下降中的逆方差-平坦度关系对于找到平坦最小值至关重要。

The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima.

作者信息

Feng Yu, Tu Yuhai

机构信息

Foundations of AI, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598.

Department of Physics, Duke University, Durham, NC 27710.

出版信息

Proc Natl Acad Sci U S A. 2021 Mar 2;118(9). doi: 10.1073/pnas.2015617118.

DOI:10.1073/pnas.2015617118

PMID:33619091

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7936325/

Abstract

摘要

随机梯度下降中的逆方差-平坦度关系对于找到平坦最小值至关重要。

The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

随机梯度下降中的逆方差-平坦度关系对于找到平坦最小值至关重要。

The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima.

作者信息

机构信息

出版信息