Department of Physics, University of Cambridge, Cambridge CB3 0HE, United Kingdom;
Department of Physics, University of Cambridge, Cambridge CB3 0HE, United Kingdom.
Proc Natl Acad Sci U S A. 2020 Sep 8;117(36):21857-21864. doi: 10.1073/pnas.1919995117. Epub 2020 Aug 25.
The predictive capabilities of deep neural networks (DNNs) continue to evolve to increasingly impressive levels. However, it is still unclear how training procedures for DNNs succeed in finding parameters that produce good results for such high-dimensional and nonconvex loss functions. In particular, we wish to understand why simple optimization schemes, such as stochastic gradient descent, do not end up trapped in local minima with high loss values that would not yield useful predictions. We explain the optimizability of DNNs by characterizing the local minima and transition states of the loss-function landscape (LFL) along with their connectivity. We show that the LFL of a DNN in the shallow network or data-abundant limit is funneled, and thus easy to optimize. Crucially, in the opposite low-data/deep limit, although the number of minima increases, the landscape is characterized by many minima with similar loss values separated by low barriers. This organization is different from the hierarchical landscapes of structural glass formers and explains why minimization procedures commonly employed by the machine-learning community can navigate the LFL successfully and reach low-lying solutions.
深度神经网络(DNN)的预测能力不断发展,达到了令人印象深刻的水平。然而,目前尚不清楚 DNN 的训练过程如何成功地找到能够产生高维、非凸损失函数的良好结果的参数。特别是,我们希望了解为什么简单的优化方案,如随机梯度下降,不会最终陷入高损失值的局部最小值,而这些局部最小值无法产生有用的预测。我们通过描述损失函数景观(LFL)的局部最小值和过渡状态及其连通性来解释 DNN 的可优化性。我们表明,在浅层网络或数据丰富的极限下,DNN 的 LFL 是漏斗形的,因此很容易优化。至关重要的是,在相反的低数据/深层极限下,尽管最小值的数量增加了,但景观的特征是具有相似损失值的许多最小值之间存在低势垒。这种组织与结构玻璃形成体的层次化景观不同,解释了为什么机器学习社区常用的最小化程序可以成功地穿越 LFL 并达到低势垒的解决方案。