Goldt Sebastian, Advani Madhu S, Saxe Andrew M, Krzakala Florent, Zdeborová Lenka
Institut de Physique Théorique, CNRS, CEA, Université Paris-Saclay, France.
Center for Brain Science, Harvard University, Cambridge, MA 02138, United States of America.
J Stat Mech. 2020 Dec;2020(12):124010. doi: 10.1088/1742-5468/abc61e. Epub 2020 Dec 21.
Deep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher-student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this description is asymptotically exact in the limit of large inputs. Using this framework, we calculate the final generalisation error of student networks that have more parameters than their teachers. We find that the final generalisation error of the student increases with network size when training only the first layer, but stays constant or even decreases with size when training both layers. We show that these different behaviours have their root in the different solutions SGD finds for different activation functions. Our results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.
深度神经网络即使在拥有足够参数以轻松拟合所有训练数据的情况下,仍能实现出色的泛化能力。我们通过分析过参数化的两层神经网络在教师-学生设置中的动态特性和性能来研究这一现象,在这种设置中,一个网络(学生网络)在由另一个网络(称为教师网络)生成的数据上进行训练。我们展示了一组微分方程如何捕捉随机梯度下降(SGD)的动态特性,并证明这种描述在大输入极限下是渐近精确的。使用这个框架,我们计算了参数比教师网络更多的学生网络的最终泛化误差。我们发现,仅训练第一层时,学生网络的最终泛化误差会随着网络规模的增加而增加,但在训练两层时,泛化误差会保持不变甚至随着规模的增加而减小。我们表明,这些不同的行为源于SGD针对不同激活函数找到的不同解决方案。我们的结果表明,在神经网络中实现良好的泛化能力不仅仅取决于SGD本身的特性,还至少取决于算法、模型架构和数据集之间的相互作用。