Tian Zhuotao, Chen Pengguang, Lai Xin, Jiang Li, Liu Shu, Zhao Hengshuang, Yu Bei, Yang Ming-Chang, Jia Jiaya
IEEE Trans Pattern Anal Mach Intell. 2023 Feb;45(2):1372-1387. doi: 10.1109/TPAMI.2022.3159581. Epub 2023 Jan 6.
Strong semantic segmentation models require large backbones to achieve promising performance, making it hard to adapt to real applications where effective real-time algorithms are needed. Knowledge distillation tackles this issue by letting the smaller model (student) produce similar pixel-wise predictions to that of a larger model (teacher). However, the classifier, which can be deemed as the perspective by which models perceive the encoded features for yielding observations (i.e., predictions), is shared by all training samples, fitting a universal feature distribution. Since good generalization to the entire distribution may bring the inferior specification to individual samples with a certain capacity, the shared universal perspective often overlooks details existing in each sample, causing degradation of knowledge distillation. In this paper, we propose Adaptive Perspective Distillation (APD) that creates an adaptive local perspective for each individual training sample. It extracts detailed contextual information from each training sample specifically, mining more details from the teacher and thus achieving better knowledge distillation results on the student. APD has no structural constraints to both teacher and student models, thus generalizing well to different semantic segmentation models. Extensive experiments on Cityscapes, ADE20K, and PASCAL-Context manifest the effectiveness of our proposed APD. Besides, APD can yield favorable performance gain to the models in both object detection and instance segmentation without bells and whistles.
强大的语义分割模型需要大型主干网络才能实现出色的性能,这使得它们难以适应需要有效实时算法的实际应用。知识蒸馏通过让较小的模型(学生模型)产生与较大模型(教师模型)相似的逐像素预测来解决这个问题。然而,分类器可以被视为模型感知编码特征以产生观察结果(即预测)的视角,它由所有训练样本共享,拟合的是通用特征分布。由于对整个分布的良好泛化可能会在一定能力下导致对单个样本的特定性不足,这种共享的通用视角往往会忽略每个样本中存在的细节,从而导致知识蒸馏效果下降。在本文中,我们提出了自适应视角蒸馏(APD),它为每个单独的训练样本创建一个自适应的局部视角。它专门从每个训练样本中提取详细的上下文信息,从教师模型中挖掘更多细节,从而在学生模型上实现更好的知识蒸馏结果。APD对教师模型和学生模型都没有结构上的限制,因此能很好地推广到不同的语义分割模型。在Cityscapes、ADE20K和PASCAL-Context上进行的大量实验证明了我们提出的APD的有效性。此外,APD可以在不增加额外复杂功能的情况下,为目标检测和实例分割中的模型带来良好的性能提升。