IEEE Trans Pattern Anal Mach Intell. 2022 Aug;44(8):4388-4403. doi: 10.1109/TPAMI.2021.3067100. Epub 2022 Jul 1.
Remarkable achievements have been obtained by deep neural networks in the last several years. However, the breakthrough in neural networks accuracy is always accompanied by explosive growth of computation and parameters, which leads to a severe limitation of model deployment. In this paper, we propose a novel knowledge distillation technique named self-distillation to address this problem. Self-distillation attaches several attention modules and shallow classifiers at different depths of neural networks and distills knowledge from the deepest classifier to the shallower classifiers. Different from the conventional knowledge distillation methods where the knowledge of the teacher model is transferred to another student model, self-distillation can be considered as knowledge transfer in the same model - from the deeper layers to the shallow layers. Moreover, the additional classifiers in self-distillation allow the neural network to work in a dynamic manner, which leads to a much higher acceleration. Experiments demonstrate that self-distillation has consistent and significant effectiveness on various neural networks and datasets. On average, 3.49 and 2.32 percent accuracy boost are observed on CIFAR100 and ImageNet. Besides, experiments show that self-distillation can be combined with other model compression methods, including knowledge distillation, pruning and lightweight model design.
在过去的几年中,深度神经网络取得了显著的成就。然而,神经网络准确性的突破总是伴随着计算和参数的爆炸式增长,这导致了模型部署的严重限制。在本文中,我们提出了一种名为自蒸馏的新的知识蒸馏技术来解决这个问题。自蒸馏在神经网络的不同深度附加了几个注意力模块和浅层分类器,并从最深层的分类器向较浅层的分类器提取知识。与传统的知识蒸馏方法不同,传统的知识蒸馏方法是将教师模型的知识转移到另一个学生模型中,自蒸馏可以被视为同一模型中的知识转移——从深层到浅层。此外,自蒸馏中的附加分类器允许神经网络以动态的方式工作,从而实现更高的加速。实验表明,自蒸馏在各种神经网络和数据集上都具有一致的显著效果。平均而言,在 CIFAR100 和 ImageNet 上分别观察到 3.49%和 2.32%的准确率提升。此外,实验表明,自蒸馏可以与其他模型压缩方法(包括知识蒸馏、剪枝和轻量级模型设计)结合使用。