Suppr超能文献

人类大脑中的高级视觉表征与大语言模型相一致。

High-level visual representations in the human brain are aligned with large language models.

作者信息

Doerig Adrien, Kietzmann Tim C, Allen Emily, Wu Yihan, Naselaris Thomas, Kay Kendrick, Charest Ian

机构信息

Department of Psychology and Education, Freie Universität Berlin, Berlin, Germany.

Institute of Cognitive Science, University of Osnabrück, Osnabrück, Germany.

出版信息

Nat Mach Intell. 2025;7(8):1220-1234. doi: 10.1038/s42256-025-01072-0. Epub 2025 Aug 7.

Abstract

The human brain extracts complex information from visual inputs, including objects, their spatial and semantic interrelations, and their interactions with the environment. However, a quantitative approach for studying this information remains elusive. Here we test whether the contextual information encoded in large language models (LLMs) is beneficial for modelling the complex visual information extracted by the brain from natural scenes. We show that LLM embeddings of scene captions successfully characterize brain activity evoked by viewing the natural scenes. This mapping captures selectivities of different brain areas and is sufficiently robust that accurate scene captions can be reconstructed from brain activity. Using carefully controlled model comparisons, we then proceed to show that the accuracy with which LLM representations match brain representations derives from the ability of LLMs to integrate complex information contained in scene captions beyond that conveyed by individual words. Finally, we train deep neural network models to transform image inputs into LLM representations. Remarkably, these networks learn representations that are better aligned with brain representations than a large number of state-of-the-art alternative models, despite being trained on orders-of-magnitude less data. Overall, our results suggest that LLM embeddings of scene captions provide a representational format that accounts for complex information extracted by the brain from visual inputs.

摘要

人类大脑从视觉输入中提取复杂信息,包括物体、它们的空间和语义相互关系以及它们与环境的相互作用。然而,研究此类信息的定量方法仍然难以捉摸。在这里,我们测试大语言模型(LLMs)中编码的上下文信息是否有助于对大脑从自然场景中提取的复杂视觉信息进行建模。我们表明,场景字幕的大语言模型嵌入成功地表征了观看自然场景所诱发的大脑活动。这种映射捕捉了不同脑区的选择性,并且足够稳健,以至于可以从大脑活动中重建准确的场景字幕。通过精心控制的模型比较,我们进而表明,大语言模型表征与大脑表征的匹配精度源于大语言模型整合场景字幕中包含的复杂信息的能力,而不仅仅是单个单词所传达的信息。最后,我们训练深度神经网络模型将图像输入转换为大语言模型表征。值得注意的是,尽管这些网络在数量级少得多的数据上进行训练,但它们学习到的表征比大量先进的替代模型更能与大脑表征对齐。总体而言,我们的结果表明,场景字幕的大语言模型嵌入提供了一种表征形式,能够解释大脑从视觉输入中提取的复杂信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcaf/12364710/3405fee21dc9/42256_2025_1072_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验