Xu Mengjia, Rangamani Akshay, Liao Qianli, Galanti Tomer, Poggio Tomaso
Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA, USA.
Division of Applied Mathematics, Brown University, Providence, RI, USA.
Research (Wash D C). 2023 Mar 8;6:0024. doi: 10.34133/research.0024. eCollection 2023.
We overview several properties-old and new-of training overparameterized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous rectified linear unit networks. We study the convergence to a solution with the absolute minimum , which is the product of the Frobenius norms of each layer weight matrix, when normalization by Lagrange multipliers is used together with weight decay under different forms of gradient descent. A main property of the minimizers that bound their expected error for a specific network architecture is . In particular, we derive novel norm-based bounds for convolutional layers that are orders of magnitude better than classical bounds for dense networks. Next, we prove that quasi-interpolating solutions obtained by stochastic gradient descent in the presence of weight decay have a bias toward low-rank weight matrices, which should improve generalization. The same analysis predicts the existence of an inherent stochastic gradient descent noise for deep networks. In both cases, we verify our predictions experimentally. We then predict neural collapse and its properties without any specific assumption-unlike other published proofs. Our analysis supports the idea that the advantage of deep networks relative to other classifiers is greater for problems that are appropriate for sparse deep architectures such as convolutional neural networks. The reason is that compositionally sparse target functions can be approximated well by "sparse" deep networks without incurring in the curse of dimensionality.
我们概述了在平方损失下训练过参数化深度网络的几个新旧特性。我们首先考虑深度齐次整流线性单元网络中平方损失下梯度流的动力学模型。我们研究了在不同形式的梯度下降下,当使用拉格朗日乘子归一化并结合权重衰减时,收敛到具有绝对最小值的解的情况,该绝对最小值是各层权重矩阵的Frobenius范数的乘积。对于特定网络架构,界定其预期误差的极小值的一个主要特性是 。特别是,我们为卷积层推导了基于范数的新界限,这些界限比密集网络的经典界限好几个数量级。接下来,我们证明在存在权重衰减的情况下,通过随机梯度下降获得的拟插值解对低秩权重矩阵有偏差,这应该会提高泛化能力。相同的分析预测了深度网络存在固有的随机梯度下降噪声。在这两种情况下,我们都通过实验验证了我们的预测。然后,我们在没有任何特定假设的情况下预测神经崩溃及其特性——这与其他已发表的证明不同。我们的分析支持这样一种观点,即对于适合稀疏深度架构(如卷积神经网络)的问题,深度网络相对于其他分类器的优势更大。原因是组合稀疏目标函数可以被“稀疏”深度网络很好地近似,而不会陷入维度灾难。