Zheng Yujie, Wang Chong, Tao Chenchen, Lin Sunqi, Qian Jiangbo, Wu Jiafei
IEEE Trans Image Process. 2024;33:5551-5563. doi: 10.1109/TIP.2024.3463421. Epub 2024 Oct 4.
Knowledge distillation aims to achieve model compression by transferring knowledge from complex teacher models to lightweight student models. To reduce reliance on pre-trained teacher models, self-distillation methods utilize knowledge from the model itself as additional supervision. However, their performance is limited by the same or similar network architecture between the teacher and student. In order to increase architecture variety, we propose a new self-distillation framework called restructured self-distillation (RSD), which involves restructuring both the teacher and student networks. The self-distilled model is expanded into a multi-branch topology to create a more powerful teacher. During training, diverse student sub-networks are generated by randomly discarding the teacher's branches. Additionally, the teacher and student models are linked by a randomly inserted feature mixture block, introducing additional knowledge distillation in the mixed feature space. To avoid extra inference costs, the branches of the teacher model are then converted back to its original structure equivalently. Comprehensive experiments have demonstrated the effectiveness of our proposed framework for most architectures on CIFAR-10/100 and ImageNet datasets. Code is available at https://github.com/YujieZheng99/RSD.
知识蒸馏旨在通过将知识从复杂的教师模型转移到轻量级的学生模型来实现模型压缩。为了减少对预训练教师模型的依赖,自蒸馏方法利用模型自身的知识作为额外的监督。然而,它们的性能受到教师和学生之间相同或相似网络架构的限制。为了增加架构的多样性,我们提出了一种新的自蒸馏框架,称为重构自蒸馏(RSD),它涉及对教师和学生网络进行重构。自蒸馏模型被扩展为多分支拓扑结构以创建一个更强大的教师。在训练过程中,通过随机丢弃教师的分支来生成不同的学生子网。此外,教师和学生模型通过一个随机插入的特征混合块相连,在混合特征空间中引入额外的知识蒸馏。为了避免额外的推理成本,然后将教师模型的分支等效地转换回其原始结构。综合实验证明了我们提出的框架对于CIFAR-10/100和ImageNet数据集上的大多数架构的有效性。代码可在https://github.com/YujieZheng99/RSD获取。