Feng Yu, Tu Yuhai
Foundations of AI, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598.
Department of Physics, Duke University, Durham, NC 27710.
Proc Natl Acad Sci U S A. 2021 Mar 2;118(9). doi: 10.1073/pnas.2015617118.
Despite tremendous success of the stochastic gradient descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions at flat minima of the loss function in high-dimensional weight space. Here, we investigate the connection between SGD learning dynamics and the loss function landscape. A principal component analysis (PCA) shows that SGD dynamics follow a low-dimensional drift-diffusion motion in the weight space. Around a solution found by SGD, the loss function landscape can be characterized by its flatness in each PCA direction. Remarkably, our study reveals a robust inverse relation between the weight variance and the landscape flatness in all PCA directions, which is the opposite to the fluctuation-response relation (aka Einstein relation) in equilibrium statistical physics. To understand the inverse variance-flatness relation, we develop a phenomenological theory of SGD based on statistical properties of the ensemble of minibatch loss functions. We find that both the anisotropic SGD noise strength (temperature) and its correlation time depend inversely on the landscape flatness in each PCA direction. Our results suggest that SGD serves as a landscape-dependent annealing algorithm. The effective temperature decreases with the landscape flatness so the system seeks out (prefers) flat minima over sharp ones. Based on these insights, an algorithm with landscape-dependent constraints is developed to mitigate catastrophic forgetting efficiently when learning multiple tasks sequentially. In general, our work provides a theoretical framework to understand learning dynamics, which may eventually lead to better algorithms for different learning tasks.
尽管随机梯度下降(SGD)算法在深度学习中取得了巨大成功,但对于SGD如何在高维权重空间中损失函数的平坦最小值处找到可泛化的解,人们了解甚少。在此,我们研究SGD学习动态与损失函数景观之间的联系。主成分分析(PCA)表明,SGD动态在权重空间中遵循低维漂移扩散运动。在SGD找到的解周围,损失函数景观可以通过其在每个PCA方向上的平坦度来表征。值得注意的是,我们的研究揭示了权重方差与所有PCA方向上的景观平坦度之间存在稳健的反比关系,这与平衡统计物理学中的涨落-响应关系(又名爱因斯坦关系)相反。为了理解方差-平坦度反比关系,我们基于小批量损失函数集合的统计特性,开发了一种SGD的唯象理论。我们发现,各向异性的SGD噪声强度(温度)及其相关时间均与每个PCA方向上的景观平坦度成反比。我们的结果表明,SGD起到了一种依赖于景观的退火算法的作用。有效温度随着景观平坦度的降低而降低,因此系统更倾向于寻找平坦的最小值而非尖锐的最小值。基于这些见解,我们开发了一种具有依赖于景观的约束的算法,以在顺序学习多个任务时有效地减轻灾难性遗忘。总的来说,我们的工作提供了一个理解学习动态的理论框架,这最终可能会带来针对不同学习任务的更好算法。