Asanuma Haruka, Koide-Majima Naoko, Nakamura Ken, Horii Takato, Nishimoto Shinji, Oizumi Masafumi
Graduate School of Arts and Sciences, The University of Tokyo, Tokyo, 153-8902, Japan.
Center for Information and Neural Networks (CiNet), National Institute of Information and Communications Technology, Osaka, 565-0871, Japan.
Sci Rep. 2025 Sep 1;15(1):32175. doi: 10.1038/s41598-025-14961-6.
Recent studies have revealed that human emotions exhibit a high-dimensional, complex structure. A full capturing of this complexity requires new approaches, as conventional models that disregard high dimensionality risk overlooking key nuances of human emotions. Here, we examined the extent to which the latest generation of rapidly evolving Multimodal Large Language Models (MLLMs) capture these high-dimensional, intricate emotion structures, including capabilities and limitations. Specifically, we compared self-reported emotion ratings from participants watching videos with model-generated estimates (e.g., Gemini or GPT). We evaluated performance not only at the individual video level but also from emotion structures that account for inter-video relationships. At the level of simple correlation between emotion structures, our results demonstrated strong similarity between human and model-inferred emotion structures. To further explore whether the similarity between humans and models is at the signle-item level or the coarse-category level, we applied Gromov-Wasserstein Optimal Transport. We found that although performance was not necessarily high at the strict, single-item level, performance across video categories that elicit similar emotions was substantial, indicating that the model could infer human emotional experiences at the coarse-category level. Our results suggest that current state-of-the-art MLLMs broadly capture the complex high-dimensional emotion structures at the coarse-category level, as well as their apparent limitations in accurately capturing entire structures at the single-item level.
最近的研究表明,人类情感呈现出高维度、复杂的结构。要全面捕捉这种复杂性,需要新的方法,因为忽视高维度的传统模型可能会忽略人类情感的关键细微差别。在这里,我们研究了最新一代快速发展的多模态大语言模型(MLLMs)在多大程度上捕捉到了这些高维度、复杂的情感结构,包括其能力和局限性。具体来说,我们将观看视频的参与者的自我报告情感评分与模型生成的估计值(如Gemini或GPT)进行了比较。我们不仅在单个视频层面评估了性能,还从考虑视频间关系的情感结构层面进行了评估。在情感结构的简单相关性层面,我们的结果表明人类和模型推断的情感结构之间有很强的相似性。为了进一步探究人类与模型之间的相似性是在单项层面还是粗略类别层面,我们应用了格罗莫夫-瓦瑟斯坦最优传输。我们发现,虽然在严格的单项层面性能不一定高,但在引发相似情感的视频类别中的性能相当可观,这表明模型能够在粗略类别层面推断人类的情感体验。我们的结果表明,当前最先进的MLLMs在粗略类别层面广泛捕捉到了复杂的高维度情感结构,以及它们在准确捕捉单项层面的整个结构方面明显的局限性。