Suppr超能文献

分层多模态知识蒸馏用于视觉语言预训练模型。

Layerwised multimodal knowledge distillation for vision-language pretrained model.

机构信息

School of Information Science and Engineering, Yunnan University, Kunming, China.

出版信息

Neural Netw. 2024 Jul;175:106272. doi: 10.1016/j.neunet.2024.106272. Epub 2024 Mar 26.

Abstract

The transformer-based model can simultaneously learn the representation for both images and text, providing excellent performance for multimodal applications. Practically, the large scale of parameters may hinder its deployment in resource-constrained devices, creating a need for model compression. To accomplish this goal, recent studies suggest using knowledge distillation to transfer knowledge from a larger trained teacher model to a small student model without any performance sacrifice. However, this only works with trained parameters of the student model by using the last layer of the teacher, which makes the student model easily overfit in the distillation procedure. Furthermore, the mutual interference between modalities causes more difficulties for distillation. To address these issues, the study proposed a layerwised multimodal knowledge distillation for a vision-language pretrained model. In addition to the last layer, the intermediate layers of the teacher were also used for knowledge transfer. To avoid interference between modalities, we split the multimodality into separate modalities and added them as extra inputs. Then, two auxiliary losses were implemented to encourage each modality to distill more effectively. Comparative experiments on four different multimodal tasks show that the proposed layerwised multimodality distillation achieves better performance than other KD methods for vision-language pretrained models.

摘要

基于变压器的模型可以同时学习图像和文本的表示,为多模态应用提供出色的性能。实际上,大规模的参数可能会阻碍其在资源受限的设备中的部署,因此需要进行模型压缩。为了实现这一目标,最近的研究表明,可以使用知识蒸馏将知识从一个较大的训练有素的教师模型转移到一个没有任何性能损失的小型学生模型中。然而,这仅适用于使用教师模型的最后一层对学生模型的训练参数,这使得学生模型在蒸馏过程中很容易过拟合。此外,模态之间的相互干扰给蒸馏带来了更多的困难。为了解决这些问题,该研究提出了一种用于视觉语言预训练模型的分层多模态知识蒸馏方法。除了最后一层,教师的中间层也被用于知识转移。为了避免模态之间的干扰,我们将多模态分为单独的模态,并将其作为额外的输入添加。然后,实施了两个辅助损失来鼓励每个模态更有效地进行蒸馏。在四个不同的多模态任务上的对比实验表明,所提出的分层多模态蒸馏方法在视觉语言预训练模型的知识蒸馏方面优于其他 KD 方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验