Suppr超能文献

人类与多模态语言模型之间从视频片段中引发的高维情感结构的对应关系。

Correspondence of high dimensional emotion structures elicited from video clips between humans and multimodal LLMs.

作者信息

Asanuma Haruka, Koide-Majima Naoko, Nakamura Ken, Horii Takato, Nishimoto Shinji, Oizumi Masafumi

机构信息

Graduate School of Arts and Sciences, The University of Tokyo, Tokyo, 153-8902, Japan.

Center for Information and Neural Networks (CiNet), National Institute of Information and Communications Technology, Osaka, 565-0871, Japan.

出版信息

Sci Rep. 2025 Sep 1;15(1):32175. doi: 10.1038/s41598-025-14961-6.

Abstract

Recent studies have revealed that human emotions exhibit a high-dimensional, complex structure. A full capturing of this complexity requires new approaches, as conventional models that disregard high dimensionality risk overlooking key nuances of human emotions. Here, we examined the extent to which the latest generation of rapidly evolving Multimodal Large Language Models (MLLMs) capture these high-dimensional, intricate emotion structures, including capabilities and limitations. Specifically, we compared self-reported emotion ratings from participants watching videos with model-generated estimates (e.g., Gemini or GPT). We evaluated performance not only at the individual video level but also from emotion structures that account for inter-video relationships. At the level of simple correlation between emotion structures, our results demonstrated strong similarity between human and model-inferred emotion structures. To further explore whether the similarity between humans and models is at the signle-item level or the coarse-category level, we applied Gromov-Wasserstein Optimal Transport. We found that although performance was not necessarily high at the strict, single-item level, performance across video categories that elicit similar emotions was substantial, indicating that the model could infer human emotional experiences at the coarse-category level. Our results suggest that current state-of-the-art MLLMs broadly capture the complex high-dimensional emotion structures at the coarse-category level, as well as their apparent limitations in accurately capturing entire structures at the single-item level.

摘要

最近的研究表明,人类情感呈现出高维度、复杂的结构。要全面捕捉这种复杂性,需要新的方法,因为忽视高维度的传统模型可能会忽略人类情感的关键细微差别。在这里,我们研究了最新一代快速发展的多模态大语言模型(MLLMs)在多大程度上捕捉到了这些高维度、复杂的情感结构,包括其能力和局限性。具体来说,我们将观看视频的参与者的自我报告情感评分与模型生成的估计值(如Gemini或GPT)进行了比较。我们不仅在单个视频层面评估了性能,还从考虑视频间关系的情感结构层面进行了评估。在情感结构的简单相关性层面,我们的结果表明人类和模型推断的情感结构之间有很强的相似性。为了进一步探究人类与模型之间的相似性是在单项层面还是粗略类别层面,我们应用了格罗莫夫-瓦瑟斯坦最优传输。我们发现,虽然在严格的单项层面性能不一定高,但在引发相似情感的视频类别中的性能相当可观,这表明模型能够在粗略类别层面推断人类的情感体验。我们的结果表明,当前最先进的MLLMs在粗略类别层面广泛捕捉到了复杂的高维度情感结构,以及它们在准确捕捉单项层面的整个结构方面明显的局限性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df0c/12402258/f3c77b9678f0/41598_2025_14961_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验