Suppr超能文献

用于视频记忆性预测的大型视觉语言模型的参数高效适配

Parameter-Efficient Adaptation of Large Vision-Language Models for Video Memorability Prediction.

作者信息

Martín-Fernández Iván, Esteban-Romero Sergio, Fernández-Martínez Fernando, Gil-Martín Manuel

机构信息

Grupo de Tecnología del Habla y Aprendizaje Automático (THAU Group), Information Processing and Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid (UPM), 28040 Madrid, Spain.

出版信息

Sensors (Basel). 2025 Mar 7;25(6):1661. doi: 10.3390/s25061661.

Abstract

The accurate modelling of video memorability, or the intrinsic properties that render a piece of audiovisual content more likely to be remembered, will facilitate the development of automatic systems that are more efficient in retrieving, classifying and generating impactful media. Recent studies have indicated a strong correlation between the visual semantics of video and its memorability. This underscores the importance of developing advanced visual comprehension abilities to enhance model performance. It has been demonstrated that Large Vision-Language Models (LVLMs) demonstrate exceptional proficiency in generalist, high-level semantic comprehension of images and video, due to their extensive multimodal pre-training on a vast scale. This work makes use of the vast generalist knowledge of LVLMs and explores efficient adaptation techniques with a view to utilising them as memorability predictors. In particular, the Quantized Low-Rank Adaptation (QLoRA) technique is employed to fine-tune the Qwen-VL model with memorability-related data extracted from the Memento10k dataset. In light of existing research, we propose a particular methodology that transforms Qwen-VL from a language model to a memorability score regressor. Furthermore, we consider the influence of selecting appropriate LoRA hyperparameters, a design aspect that has been insufficiently studied. We validate the LoRA rank and alpha hyperparameters using 5-Fold Cross-Validation and evaluate our best configuration on the official testing portion of the Memento10k dataset, obtaining a state-of-the-art Spearman Rank Correlation Coefficient (SRCC) of 0.744. Consequently, this work represents a significant advancement in modelling video memorability through high-level semantic understanding.

摘要

视频记忆性的准确建模,即那些使一段视听内容更易被记住的内在属性,将有助于开发在检索、分类和生成有影响力的媒体方面更高效的自动系统。最近的研究表明视频的视觉语义与其记忆性之间存在很强的相关性。这凸显了发展先进视觉理解能力以提高模型性能的重要性。已经证明,大型视觉语言模型(LVLMs)由于在大规模上进行了广泛的多模态预训练,在对图像和视频的通用、高级语义理解方面表现出卓越的能力。这项工作利用了LVLMs的广泛通用知识,并探索了有效的适应技术,以期将它们用作记忆性预测器。具体而言,采用量化低秩适应(QLoRA)技术,用从Memento10k数据集中提取的与记忆性相关的数据对Qwen-VL模型进行微调。鉴于现有研究,我们提出了一种特定的方法,将Qwen-VL从语言模型转变为记忆性分数回归器。此外,我们考虑了选择合适的LoRA超参数的影响,这一设计方面尚未得到充分研究。我们使用五折交叉验证来验证LoRA秩和α超参数,并在Memento10k数据集的官方测试部分评估我们的最佳配置,获得了0.744的最优斯皮尔曼等级相关系数(SRCC)。因此,这项工作通过高级语义理解在视频记忆性建模方面取得了重大进展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e414/11944706/4e2c73874c2c/sensors-25-01661-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验