Guo Zhen, Wang Dong, He Qiang, Zhang Pengzhou
Communication University of China, State Key Laboratory of Media Convergence and Communication, Beijing, 100024, China.
China Unicom Smart City Research Institute, Beijing, 100048, China.
Sci Rep. 2024 Dec 28;14(1):31249. doi: 10.1038/s41598-024-82647-6.
Knowledge distillation improves student model performance. However, using a larger teacher model does not necessarily result in better distillation gains due to significant architecture and output gaps with smaller student networks. To address this issue, we reconsider teacher outputs and find that categories with strong teacher confidence benefit distillation more, while those with weaker certainty contribute less. Thus, we propose Logits Uncertainty Distillation (LUD) to bridge this gap. We introduce category uncertainty weighting to consider the uncertainty in the teacher model's predictions. A confidence threshold, based on the teacher's predictions, helps construct a mask that discounts uncertain classes during distillation. Furthermore, we incorporate two Spearman correlation loss functions to align the logits of the teacher and student models. These loss functions measure the discrepancy between the models' outputs at the category and sample levels. We also introduce adaptive dynamic temperature factors to optimize the distillation process. By combining these techniques, we enhance knowledge distillation results and facilitate effective knowledge transfer between teacher and student models, even when architectural differences exist. Extensive experiments on multiple datasets demonstrate the effectiveness of our method.
知识蒸馏可提高学生模型的性能。然而,由于较大的教师模型与较小的学生网络在架构和输出上存在显著差距,使用更大的教师模型并不一定会带来更好的蒸馏收益。为了解决这个问题,我们重新审视教师输出,发现教师置信度高的类别对蒸馏的益处更大,而确定性较弱的类别贡献较小。因此,我们提出了对数概率不确定性蒸馏(LUD)来弥合这一差距。我们引入类别不确定性加权来考虑教师模型预测中的不确定性。基于教师预测的置信度阈值有助于构建一个掩码,在蒸馏过程中对不确定的类别进行折扣。此外,我们纳入了两个斯皮尔曼相关损失函数,以使教师模型和学生模型的对数概率对齐。这些损失函数在类别和样本层面衡量模型输出之间的差异。我们还引入了自适应动态温度因子来优化蒸馏过程。通过结合这些技术,即使存在架构差异,我们也能提高知识蒸馏的结果,并促进教师模型和学生模型之间的有效知识转移。在多个数据集上进行的广泛实验证明了我们方法的有效性。