师生模式下两层神经网络的随机梯度下降动力学

Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup.

作者信息

Goldt Sebastian, Advani Madhu S, Saxe Andrew M, Krzakala Florent, Zdeborová Lenka

机构信息

Institut de Physique Théorique, CNRS, CEA, Université Paris-Saclay, France.

Center for Brain Science, Harvard University, Cambridge, MA 02138, United States of America.

出版信息

J Stat Mech. 2020 Dec;2020(12):124010. doi: 10.1088/1742-5468/abc61e. Epub 2020 Dec 21.

DOI:10.1088/1742-5468/abc61e

PMID:34262607

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8252911/

Abstract

Deep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher-student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this description is asymptotically exact in the limit of large inputs. Using this framework, we calculate the final generalisation error of student networks that have more parameters than their teachers. We find that the final generalisation error of the student increases with network size when training only the first layer, but stays constant or even decreases with size when training both layers. We show that these different behaviours have their root in the different solutions SGD finds for different activation functions. Our results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.

摘要

深度神经网络即使在拥有足够参数以轻松拟合所有训练数据的情况下，仍能实现出色的泛化能力。我们通过分析过参数化的两层神经网络在教师-学生设置中的动态特性和性能来研究这一现象，在这种设置中，一个网络（学生网络）在由另一个网络（称为教师网络）生成的数据上进行训练。我们展示了一组微分方程如何捕捉随机梯度下降（SGD）的动态特性，并证明这种描述在大输入极限下是渐近精确的。使用这个框架，我们计算了参数比教师网络更多的学生网络的最终泛化误差。我们发现，仅训练第一层时，学生网络的最终泛化误差会随着网络规模的增加而增加，但在训练两层时，泛化误差会保持不变甚至随着规模的增加而减小。我们表明，这些不同的行为源于SGD针对不同激活函数找到的不同解决方案。我们的结果表明，在神经网络中实现良好的泛化能力不仅仅取决于SGD本身的特性，还至少取决于算法、模型架构和数据集之间的相互作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1acf/8252911/76aca598a61f/jstatabc61ef1_hr.jpg

相似文献

Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup.

J Stat Mech. 2020 Dec;2020(12):124010. doi: 10.1088/1742-5468/abc61e. Epub 2020 Dec 21.

A mean field view of the landscape of two-layer neural networks.

Proc Natl Acad Sci U S A. 2018 Aug 14;115(33):E7665-E7671. doi: 10.1073/pnas.1806579115. Epub 2018 Jul 27.

On the different regimes of stochastic gradient descent.

Proc Natl Acad Sci U S A. 2024 Feb 27;121(9):e2316301121. doi: 10.1073/pnas.2316301121. Epub 2024 Feb 20.

Accelerating deep neural network training with inconsistent stochastic gradient descent.

Neural Netw. 2017 Sep;93:219-229. doi: 10.1016/j.neunet.2017.06.003. Epub 2017 Jun 16.

Soft mode in the dynamics of over-realizable online learning for soft committee machines.

Phys Rev E. 2022 May;105(5):L052302. doi: 10.1103/PhysRevE.105.L052302.

Accelerating DNN Training Through Selective Localized Learning.

Front Neurosci. 2022 Jan 11;15:759807. doi: 10.3389/fnins.2021.759807. eCollection 2021.

New role for circuit expansion for learning in neural networks.

Phys Rev E. 2021 Feb;103(2-1):022404. doi: 10.1103/PhysRevE.103.022404.

Anomalous diffusion dynamics of learning in deep neural networks.

Neural Netw. 2022 May;149:18-28. doi: 10.1016/j.neunet.2022.01.019. Epub 2022 Feb 3.

Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling.

IEEE Trans Neural Netw Learn Syst. 2020 Nov;31(11):4649-4659. doi: 10.1109/TNNLS.2019.2957003. Epub 2020 Oct 29.

Estimation of Granger causality through Artificial Neural Networks: applications to physiological systems and chaotic electronic oscillators.

PeerJ Comput Sci. 2021 May 18;7:e429. doi: 10.7717/peerj-cs.429. eCollection 2021.

引用本文的文献

Summary statistics of learning link changing neural representations to behavior.

ArXiv. 2025 Jul 14:arXiv:2504.16920v2.

Exact learning dynamics of deep linear networks with prior knowledge.

J Stat Mech. 2023 Nov 1;2023(11):114004. doi: 10.1088/1742-5468/ad01b8. Epub 2023 Nov 15.

On the different regimes of stochastic gradient descent.

Proc Natl Acad Sci U S A. 2024 Feb 27;121(9):e2316301121. doi: 10.1073/pnas.2316301121. Epub 2024 Feb 20.

Multiscale relevance of natural images.

Sci Rep. 2023 Sep 9;13(1):14879. doi: 10.1038/s41598-023-41714-0.

Efficient neural codes naturally emerge through gradient descent learning.

Nat Commun. 2022 Dec 29;13(1):7972. doi: 10.1038/s41467-022-35659-7.

Developmental and evolutionary constraints on olfactory circuit selection.

Proc Natl Acad Sci U S A. 2022 Mar 15;119(11):e2100600119. doi: 10.1073/pnas.2100600119. Epub 2022 Mar 9.

本文引用的文献

High-dimensional dynamics of generalization error in neural networks.

Neural Netw. 2020 Dec;132:428-446. doi: 10.1016/j.neunet.2020.08.022. Epub 2020 Sep 5.

A mean field view of the landscape of two-layer neural networks.

Proc Natl Acad Sci U S A. 2018 Aug 14;115(33):E7665-E7671. doi: 10.1073/pnas.1806579115. Epub 2018 Jul 27.

Deep learning.

Nature. 2015 May 28;521(7553):436-44. doi: 10.1038/nature14539.

Exact solution for on-line learning in multilayer neural networks.

Phys Rev Lett. 1995 May 22;74(21):4337-4340. doi: 10.1103/PhysRevLett.74.4337.

On-line learning in soft committee machines.

Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1995 Oct;52(4):4225-4243. doi: 10.1103/physreve.52.4225.

Statistical mechanics of learning from examples.

Phys Rev A. 1992 Apr 15;45(8):6056-6091. doi: 10.1103/physreva.45.6056.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

师生模式下两层神经网络的随机梯度下降动力学

Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献