Tan Chao, Liu Jie
IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2290-2299. doi: 10.1109/TNNLS.2022.3189680. Epub 2024 Feb 5.
Knowledge distillation (KD) is a widely used approach to transfer knowledge from a cumbersome network (also known as a teacher) to a lightweight network (also known as a student). However, even though the accuracies of different teachers are similar, the fixed student's accuracies are significantly different. We find that teachers with more dispersed secondary soft probabilities are more qualified to play their roles. Therefore, an indicator, i.e., the standard deviation σ of secondary soft probabilities, is introduced to choose the teacher. Moreover, to make a teacher's secondary soft probabilities more dispersed, a novel method, dubbed pretraining the teacher under dual supervision (PTDS), is proposed to pretrain a teacher under dual supervision. In addition, we put forward an asymmetrical transformation function (ATF) to further enhance the dispersion degree of the pretrained teachers' secondary soft probabilities. The combination of PTDS and ATF is termed knowledge distillation with a customized teacher (KDCT). Extensive empirical experiments and analyses are conducted on three computer vision tasks, including image classification, transfer learning, and semantic segmentation, to substantiate the effectiveness of KDCT.
知识蒸馏(KD)是一种广泛使用的方法,用于将知识从一个复杂的网络(也称为教师网络)转移到一个轻量级网络(也称为学生网络)。然而,尽管不同教师网络的准确率相似,但固定的学生网络的准确率却有显著差异。我们发现,具有更分散的二次软概率的教师网络更有资格发挥其作用。因此,引入了一个指标,即二次软概率的标准差σ,来选择教师网络。此外,为了使教师网络的二次软概率更加分散,提出了一种新的方法,称为双重监督下的教师预训练(PTDS),用于在双重监督下对教师网络进行预训练。此外,我们还提出了一种非对称变换函数(ATF),以进一步提高预训练教师网络二次软概率的分散程度。PTDS和ATF的组合被称为定制教师的知识蒸馏(KDCT)。我们在包括图像分类、迁移学习和语义分割在内的三个计算机视觉任务上进行了广泛的实证实验和分析,以证实KDCT的有效性。