用于视频记忆性预测的大型视觉语言模型的参数高效适配

Parameter-Efficient Adaptation of Large Vision-Language Models for Video Memorability Prediction.

作者信息

Martín-Fernández Iván, Esteban-Romero Sergio, Fernández-Martínez Fernando, Gil-Martín Manuel

机构信息

Grupo de Tecnología del Habla y Aprendizaje Automático (THAU Group), Information Processing and Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid (UPM), 28040 Madrid, Spain.

出版信息

Sensors (Basel). 2025 Mar 7;25(6):1661. doi: 10.3390/s25061661.

DOI:10.3390/s25061661

PMID:40292713

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11944706/

Abstract

The accurate modelling of video memorability, or the intrinsic properties that render a piece of audiovisual content more likely to be remembered, will facilitate the development of automatic systems that are more efficient in retrieving, classifying and generating impactful media. Recent studies have indicated a strong correlation between the visual semantics of video and its memorability. This underscores the importance of developing advanced visual comprehension abilities to enhance model performance. It has been demonstrated that Large Vision-Language Models (LVLMs) demonstrate exceptional proficiency in generalist, high-level semantic comprehension of images and video, due to their extensive multimodal pre-training on a vast scale. This work makes use of the vast generalist knowledge of LVLMs and explores efficient adaptation techniques with a view to utilising them as memorability predictors. In particular, the Quantized Low-Rank Adaptation (QLoRA) technique is employed to fine-tune the Qwen-VL model with memorability-related data extracted from the Memento10k dataset. In light of existing research, we propose a particular methodology that transforms Qwen-VL from a language model to a memorability score regressor. Furthermore, we consider the influence of selecting appropriate LoRA hyperparameters, a design aspect that has been insufficiently studied. We validate the LoRA rank and alpha hyperparameters using 5-Fold Cross-Validation and evaluate our best configuration on the official testing portion of the Memento10k dataset, obtaining a state-of-the-art Spearman Rank Correlation Coefficient (SRCC) of 0.744. Consequently, this work represents a significant advancement in modelling video memorability through high-level semantic understanding.

摘要

视频记忆性的准确建模，即那些使一段视听内容更易被记住的内在属性，将有助于开发在检索、分类和生成有影响力的媒体方面更高效的自动系统。最近的研究表明视频的视觉语义与其记忆性之间存在很强的相关性。这凸显了发展先进视觉理解能力以提高模型性能的重要性。已经证明，大型视觉语言模型（LVLMs）由于在大规模上进行了广泛的多模态预训练，在对图像和视频的通用、高级语义理解方面表现出卓越的能力。这项工作利用了LVLMs的广泛通用知识，并探索了有效的适应技术，以期将它们用作记忆性预测器。具体而言，采用量化低秩适应（QLoRA）技术，用从Memento10k数据集中提取的与记忆性相关的数据对Qwen-VL模型进行微调。鉴于现有研究，我们提出了一种特定的方法，将Qwen-VL从语言模型转变为记忆性分数回归器。此外，我们考虑了选择合适的LoRA超参数的影响，这一设计方面尚未得到充分研究。我们使用五折交叉验证来验证LoRA秩和α超参数，并在Memento10k数据集的官方测试部分评估我们的最佳配置，获得了0.744的最优斯皮尔曼等级相关系数（SRCC）。因此，这项工作通过高级语义理解在视频记忆性建模方面取得了重大进展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e414/11944706/4e2c73874c2c/sensors-25-01661-g001.jpg

相似文献

Parameter-Efficient Adaptation of Large Vision-Language Models for Video Memorability Prediction.

Sensors (Basel). 2025 Mar 7;25(6):1661. doi: 10.3390/s25061661.

Visual memorability in the absence of semantic content.

Cognition. 2021 Jul;212:104714. doi: 10.1016/j.cognition.2021.104714. Epub 2021 May 7.

Prediction of Visual Memorability with EEG Signals: A Comparative Study.

Sensors (Basel). 2020 May 9;20(9):2694. doi: 10.3390/s20092694.

ResMem-Net: memory based deep CNN for image memorability estimation.

PeerJ Comput Sci. 2021 Nov 5;7:e767. doi: 10.7717/peerj-cs.767. eCollection 2021.

Memorability of line drawings of scenes: the role of contour properties.

Mem Cognit. 2025 Jan;53(1):33-53. doi: 10.3758/s13421-023-01478-4. Epub 2023 Oct 30.

Learning Computational Models of Video Memorability from fMRI Brain Imaging.

IEEE Trans Cybern. 2015 Aug;45(8):1692-703. doi: 10.1109/TCYB.2014.2358647. Epub 2014 Oct 9.

MemCat: a new category-based image set quantified on memorability.

PeerJ. 2019 Dec 12;7:e8169. doi: 10.7717/peerj.8169. eCollection 2019.

Intrinsic and extrinsic effects on image memorability.

Vision Res. 2015 Nov;116(Pt B):165-78. doi: 10.1016/j.visres.2015.03.005. Epub 2015 Mar 20.

Beyond the Hype: A Dispassionate Look at Vision-Language Models in Medical Scenario.

IEEE Trans Neural Netw Learn Syst. 2025 Apr 24;PP. doi: 10.1109/TNNLS.2025.3558857.

Semantic determinants of memorability.

Cognition. 2023 Oct;239:105497. doi: 10.1016/j.cognition.2023.105497. Epub 2023 Jul 11.

本文引用的文献

Visual perception of highly memorable images is mediated by a distributed network of ventral visual regions that enable a late memorability response.

PLoS Biol. 2024 Apr 1;22(4):e3002564. doi: 10.1371/journal.pbio.3002564. eCollection 2024 Apr.

LLM Multimodal Traffic Accident Forecasting.

Sensors (Basel). 2023 Nov 16;23(22):9225. doi: 10.3390/s23229225.

Visual memorability in the absence of semantic content.

Cognition. 2021 Jul;212:104714. doi: 10.1016/j.cognition.2021.104714. Epub 2021 May 7.

Memorability of words in arbitrary verbal associations modulates memory retrieval in the anterior temporal lobe.

Nat Hum Behav. 2020 Sep;4(9):937-948. doi: 10.1038/s41562-020-0901-2. Epub 2020 Jun 29.

Population response magnitude variation in inferotemporal cortex predicts image memorability.

Elife. 2019 Aug 29;8:e47596. doi: 10.7554/eLife.47596.

A model of aesthetic appreciation and aesthetic judgments.

Br J Psychol. 2004 Nov;95(Pt 4):489-508. doi: 10.1348/0007126042369811.

Learning 10,000 pictures.

Q J Exp Psychol. 1973 May;25(2):207-22. doi: 10.1080/14640747308400340.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于视频记忆性预测的大型视觉语言模型的参数高效适配

Parameter-Efficient Adaptation of Large Vision-Language Models for Video Memorability Prediction.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献