MoIL：用于高效视觉语言适配的动量模仿学习

MoIL: Momentum Imitation Learning for Efficient Vision-Language Adaptation.

作者信息

Luo Gen, Zhou Yiyi, Huang Minglang, Ren Tianhe, Sun Xiaoshuai, Ji Rongrong

出版信息

IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5192-5204. doi: 10.1109/TPAMI.2024.3435790.

DOI:10.1109/TPAMI.2024.3435790

Abstract

Pre-training and fine-tuning have been the de-facto paradigm in vision-language domains. Along with the rapid growth of model sizes, fully fine-tuning these large-scale vision-language pre-training (VLP) models requires prohibitively expensive storage costs. To address this issue, recent advances in NLP offer a promising and efficient adaptation approach called LoRA, which aims to approximate the fine-tuning of large pre-trained model by updating low-rank parameters. Despite its effectiveness, we identify that LoRA suffers a large approximation error on VLP models and its optimization is also inefficient, which greatly limits its performance upper bound. In this paper, we mathematically prove that the approximation error of low-rank adaptation can be optimized by a new optimization objective, i.e., the weight distance between LoRA and fine-tuning. Based on this finding, we propose a novel PETL method for VLP models, namely momentum imitation learning (MoIL). Specifically, MoIL formulates PETL as a weight imitation learning process and directly optimize the approximation error bound of the low-rank adaptation. Based on this training scheme, we also explore a new hybrid approximation function to reduce the learning difficulty of low-rank adaptations. With these two novel designs, MoIL can greatly improve the optimization efficiency of the low-rank parameters on VLP models. We validate MoIL on three VLP models ranging from end-to-end network to two-stage network, and conduct extensive experiments on four VL tasks. Experimental results demonstrate superior performance and optimization efficiency of MoIL than existing PETL methods. For instance, by updating only 6.23% parameters, MoIL can even outperform full tuning by +2.3% on image-text matching task. Meanwhile, its inference efficiency and generalization ability is also validated by multiple VLP models, e.g., VLMO and VinVL.

摘要

预训练和微调一直是视觉语言领域的实际范式。随着模型规模的迅速增长，对这些大规模视觉语言预训练（VLP）模型进行完全微调需要极高的存储成本。为了解决这个问题，自然语言处理（NLP）领域的最新进展提供了一种很有前景且高效的适配方法，称为低秩自适应（LoRA），其目的是通过更新低秩参数来近似大型预训练模型的微调。尽管它很有效，但我们发现LoRA在VLP模型上存在较大的近似误差，并且其优化效率也很低，这极大地限制了其性能上限。在本文中，我们通过数学证明，低秩自适应的近似误差可以通过一个新的优化目标来优化，即LoRA与微调之间的权重距离。基于这一发现，我们为VLP模型提出了一种新颖的参数高效微调（PETL）方法，即动量模仿学习（MoIL）。具体来说，MoIL将PETL表述为一个权重模仿学习过程，并直接优化低秩自适应的近似误差界。基于这种训练方案，我们还探索了一种新的混合近似函数，以降低低秩自适应的学习难度。通过这两种新颖的设计，MoIL可以大大提高VLP模型上低秩参数的优化效率。我们在从端到端网络到两阶段网络的三个VLP模型上验证了MoIL，并在四个视觉语言任务上进行了广泛的实验。实验结果表明，MoIL比现有的PETL方法具有更优的性能和优化效率。例如，通过仅更新6.23% 的参数，MoIL在图像文本匹配任务上甚至可以比完全微调高出 +2.3%。同时，多个VLP模型，如VLMO和VinVL，也验证了其推理效率和泛化能力。

相似文献

MoIL: Momentum Imitation Learning for Efficient Vision-Language Adaptation.MoIL：用于高效视觉语言适配的动量模仿学习

IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5192-5204. doi: 10.1109/TPAMI.2024.3435790.

Low-Rank Adaptation of Pre-Trained Large Vision Models for Improved Lung Nodule Malignancy Classification.用于改进肺结节恶性分类的预训练大型视觉模型的低秩自适应

IEEE Open J Eng Med Biol. 2025 Jan 16;6:296-304. doi: 10.1109/OJEMB.2025.3530841. eCollection 2025.

Noise-Robust Vision-Language Pre-Training With Positive-Negative Learning.

IEEE Trans Pattern Anal Mach Intell. 2025 Jan;47(1):338-350. doi: 10.1109/TPAMI.2024.3462996. Epub 2024 Dec 4.

Hydra: Multi-head low-rank adaptation for parameter efficient fine-tuning.

Neural Netw. 2024 Oct;178:106414. doi: 10.1016/j.neunet.2024.106414. Epub 2024 Jun 7.

Enhancing semantical text understanding with fine-tuned large language models: A case study on Quora Question Pair duplicate identification.使用微调的大语言模型增强语义文本理解：以Quora问题对重复识别为例的研究

PLoS One. 2025 Jan 10;20(1):e0317042. doi: 10.1371/journal.pone.0317042. eCollection 2025.

Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification.参数高效技术与完全微调之间的比较：多语言新闻文章分类的案例研究。

PLoS One. 2024 May 3;19(5):e0301738. doi: 10.1371/journal.pone.0301738. eCollection 2024.

Impact of Noisy Supervision in Foundation Model Learning.

IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5690-5707. doi: 10.1109/TPAMI.2025.3552309.

Low-Rank Fine-Tuning Meets Cross-modal Analysis: A Robust Framework for Age-Related Macular Degeneration Categorization.低秩微调与跨模态分析：一种用于年龄相关性黄斑变性分类的稳健框架。

J Imaging Inform Med. 2025 Apr 29. doi: 10.1007/s10278-025-01513-7.

Knowledge-enhanced Parameter-efficient Transfer Learning with METER for medical vision-language tasks.

J Biomed Inform. 2025 Jun;166:104840. doi: 10.1016/j.jbi.2025.104840. Epub 2025 May 8.

Leveraging Vision Foundation Model via PConv-Based Fine-Tuning with Automated Prompter for Defect Segmentation.通过基于PConv的微调与自动提示器利用视觉基础模型进行缺陷分割。

Sensors (Basel). 2025 Apr 11;25(8):2417. doi: 10.3390/s25082417.

MoIL：用于高效视觉语言适配的动量模仿学习

MoIL: Momentum Imitation Learning for Efficient Vision-Language Adaptation.

作者信息

Luo Gen, Zhou Yiyi, Huang Minglang, Ren Tianhe, Sun Xiaoshuai, Ji Rongrong

出版信息

IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5192-5204. doi: 10.1109/TPAMI.2024.3435790.

DOI:10.1109/TPAMI.2024.3435790

PMID:39078756

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

MoIL：用于高效视觉语言适配的动量模仿学习

MoIL: Momentum Imitation Learning for Efficient Vision-Language Adaptation.

作者信息

出版信息

相似文献

MoIL：用于高效视觉语言适配的动量模仿学习

MoIL: Momentum Imitation Learning for Efficient Vision-Language Adaptation.

作者信息

出版信息

相似文献