Luo Gen, Zhou Yiyi, Huang Minglang, Ren Tianhe, Sun Xiaoshuai, Ji Rongrong
IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5192-5204. doi: 10.1109/TPAMI.2024.3435790.
Pre-training and fine-tuning have been the de-facto paradigm in vision-language domains. Along with the rapid growth of model sizes, fully fine-tuning these large-scale vision-language pre-training (VLP) models requires prohibitively expensive storage costs. To address this issue, recent advances in NLP offer a promising and efficient adaptation approach called LoRA, which aims to approximate the fine-tuning of large pre-trained model by updating low-rank parameters. Despite its effectiveness, we identify that LoRA suffers a large approximation error on VLP models and its optimization is also inefficient, which greatly limits its performance upper bound. In this paper, we mathematically prove that the approximation error of low-rank adaptation can be optimized by a new optimization objective, i.e., the weight distance between LoRA and fine-tuning. Based on this finding, we propose a novel PETL method for VLP models, namely momentum imitation learning (MoIL). Specifically, MoIL formulates PETL as a weight imitation learning process and directly optimize the approximation error bound of the low-rank adaptation. Based on this training scheme, we also explore a new hybrid approximation function to reduce the learning difficulty of low-rank adaptations. With these two novel designs, MoIL can greatly improve the optimization efficiency of the low-rank parameters on VLP models. We validate MoIL on three VLP models ranging from end-to-end network to two-stage network, and conduct extensive experiments on four VL tasks. Experimental results demonstrate superior performance and optimization efficiency of MoIL than existing PETL methods. For instance, by updating only 6.23% parameters, MoIL can even outperform full tuning by +2.3% on image-text matching task. Meanwhile, its inference efficiency and generalization ability is also validated by multiple VLP models, e.g., VLMO and VinVL.
预训练和微调一直是视觉语言领域的实际范式。随着模型规模的迅速增长,对这些大规模视觉语言预训练(VLP)模型进行完全微调需要极高的存储成本。为了解决这个问题,自然语言处理(NLP)领域的最新进展提供了一种很有前景且高效的适配方法,称为低秩自适应(LoRA),其目的是通过更新低秩参数来近似大型预训练模型的微调。尽管它很有效,但我们发现LoRA在VLP模型上存在较大的近似误差,并且其优化效率也很低,这极大地限制了其性能上限。在本文中,我们通过数学证明,低秩自适应的近似误差可以通过一个新的优化目标来优化,即LoRA与微调之间的权重距离。基于这一发现,我们为VLP模型提出了一种新颖的参数高效微调(PETL)方法,即动量模仿学习(MoIL)。具体来说,MoIL将PETL表述为一个权重模仿学习过程,并直接优化低秩自适应的近似误差界。基于这种训练方案,我们还探索了一种新的混合近似函数,以降低低秩自适应的学习难度。通过这两种新颖的设计,MoIL可以大大提高VLP模型上低秩参数的优化效率。我们在从端到端网络到两阶段网络的三个VLP模型上验证了MoIL,并在四个视觉语言任务上进行了广泛的实验。实验结果表明,MoIL比现有的PETL方法具有更优的性能和优化效率。例如,通过仅更新6.23% 的参数,MoIL在图像文本匹配任务上甚至可以比完全微调高出 +2.3%。同时,多个VLP模型,如VLMO和VinVL,也验证了其推理效率和泛化能力。