Ding Fei, Yang Yin, Hu Hongxin, Krovi Venkat, Luo Feng
IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2425-2435. doi: 10.1109/TNNLS.2022.3190166. Epub 2024 Feb 5.
Knowledge distillation (KD) has become a widely used technique for model compression and knowledge transfer. We find that the standard KD method performs the knowledge alignment on an individual sample indirectly via class prototypes and neglects the structural knowledge between different samples, namely, knowledge correlation. Although recent contrastive learning-based distillation methods can be decomposed into knowledge alignment and correlation, their correlation objectives undesirably push apart representations of samples from the same class, leading to inferior distillation results. To improve the distillation performance, in this work, we propose a novel knowledge correlation objective and introduce the dual-level knowledge distillation (DLKD), which explicitly combines knowledge alignment and correlation together instead of using one single contrastive objective. We show that both knowledge alignment and correlation are necessary to improve the distillation performance. In particular, knowledge correlation can serve as an effective regularization to learn generalized representations. The proposed DLKD is task-agnostic and model-agnostic, and enables effective knowledge transfer from supervised or self-supervised pretrained teachers to students. Experiments show that DLKD outperforms other state-of-the-art methods on a large number of experimental settings including: 1) pretraining strategies; 2) network architectures; 3) datasets; and 4) tasks.
知识蒸馏(KD)已成为一种广泛应用于模型压缩和知识转移的技术。我们发现,标准的KD方法通过类原型间接对单个样本进行知识对齐,而忽略了不同样本之间的结构知识,即知识相关性。尽管最近基于对比学习的蒸馏方法可以分解为知识对齐和相关性,但它们的相关性目标会不期望地将来自同一类别的样本表示推开,导致蒸馏结果较差。为了提高蒸馏性能,在这项工作中,我们提出了一种新颖的知识相关性目标,并引入了双层知识蒸馏(DLKD),它将知识对齐和相关性明确地结合在一起,而不是使用单一的对比目标。我们表明,知识对齐和相关性对于提高蒸馏性能都是必要的。特别是,知识相关性可以作为一种有效的正则化来学习广义表示。所提出的DLKD与任务无关且与模型无关,并能够实现从有监督或自监督预训练教师到学生的有效知识转移。实验表明,在大量实验设置上,DLKD优于其他现有方法,包括:1)预训练策略;2)网络架构;3)数据集;以及4)任务。