Li Xuewei, Li Songyuan, Omar Bourahla, Wu Fei, Li Xi
IEEE Trans Image Process. 2021;30:4735-4746. doi: 10.1109/TIP.2021.3066051. Epub 2021 May 5.
Knowledge distillation, aimed at transferring the knowledge from a heavy teacher network to a lightweight student network, has emerged as a promising technique for compressing neural networks. However, due to the capacity gap between the heavy teacher and the lightweight student, there still exists a significant performance gap between them. In this article, we see knowledge distillation in a fresh light, using the knowledge gap, or the residual, between a teacher and a student as guidance to train a much more lightweight student, called a res-student. We combine the student and the res-student into a new student, where the res-student rectifies the errors of the former student. Such a residual-guided process can be repeated until the user strikes the balance between accuracy and cost. At inference time, we propose a sample-adaptive strategy to decide which res-students are not necessary for each sample, which can save computational cost. Experimental results show that we achieve competitive performance with 18.04%, 23.14%, 53.59%, and 56.86% of the teachers' computational costs on the CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet datasets. Finally, we do thorough theoretical and empirical analysis for our method.
知识蒸馏旨在将知识从大型教师网络转移到轻量级学生网络,已成为一种很有前景的压缩神经网络的技术。然而,由于大型教师网络和轻量级学生网络之间存在能力差距,它们之间仍然存在显著的性能差距。在本文中,我们以全新的视角看待知识蒸馏,利用教师网络和学生网络之间的知识差距(即残差)作为指导来训练一个更轻量级的学生网络,称为残差学生网络。我们将学生网络和残差学生网络合并为一个新的学生网络,其中残差学生网络纠正前一个学生网络的错误。这样的残差引导过程可以重复进行,直到用户在准确性和成本之间取得平衡。在推理阶段,我们提出了一种样本自适应策略,以确定每个样本不需要哪些残差学生网络,从而可以节省计算成本。实验结果表明,在CIFAR-10、CIFAR-100、Tiny-ImageNet和ImageNet数据集上,我们分别以教师网络计算成本的18.04%、23.14%、53.59%和56.86%实现了具有竞争力的性能。最后,我们对我们的方法进行了全面的理论和实证分析。