Department of Statistics, Stanford University, Stanford, CA 94305-4065.
School of Operations Research and Information Engineering, Cornell University, Ithaca, NY 14850.
Proc Natl Acad Sci U S A. 2020 Oct 6;117(40):24652-24663. doi: 10.1073/pnas.2015509117. Epub 2020 Sep 21.
Modern practice for training classification deepnets involves a terminal phase of training (TPT), which begins at the epoch where training error first vanishes. During TPT, the training error stays effectively zero, while training loss is pushed toward zero. Direct measurements of TPT, for three prototypical deepnet architectures and across seven canonical classification datasets, expose a pervasive inductive bias we call (NC), involving four deeply interconnected phenomena. (NC1) Cross-example within-class variability of last-layer training activations collapses to zero, as the individual activations themselves collapse to their class means. (NC2) The class means collapse to the vertices of a simplex equiangular tight frame (ETF). (NC3) Up to rescaling, the last-layer classifiers collapse to the class means or in other words, to the simplex ETF (i.e., to a self-dual configuration). (NC4) For a given activation, the classifier's decision collapses to simply choosing whichever class has the closest train class mean (i.e., the nearest class center [NCC] decision rule). The symmetric and very simple geometry induced by the TPT confers important benefits, including better generalization performance, better robustness, and better interpretability.
现代的分类深度网络训练实践涉及到一个终端训练阶段(TPT),它从训练误差首次消失的那个时期开始。在 TPT 期间,训练误差保持在有效零值,而训练损失则被推向零。对三种典型的深度网络架构和七个规范分类数据集的 TPT 的直接测量,揭示了我们称之为归一化常数(NC)的普遍归纳偏差,涉及到四个深度相互关联的现象。(NC1)最后一层训练激活的跨示例内类变异性崩溃为零,因为各个激活本身崩溃到它们的类均值。(NC2)类均值崩溃到一个单纯形等角紧框架(ETF)的顶点。(NC3)在缩放之前,最后一层分类器崩溃到类均值,或者换句话说,崩溃到单纯形 ETF(即,自对偶配置)。(NC4)对于给定的激活,分类器的决策崩溃为简单地选择哪个类具有最接近的训练类均值(即,最近的类中心 [NCC]决策规则)。TPT 诱导的对称和非常简单的几何形状带来了重要的好处,包括更好的泛化性能、更好的鲁棒性和更好的可解释性。