Tang Jerry, Du Meng, Vo Vy A, Lal Vasudev, Huth Alexander G
UT Austin.
Intel Labs, UCLA.
Adv Neural Inf Process Syst. 2023 Dec;36:29654-29666.
Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts in language and vision. In this work, we used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality, particularly in cortical regions that represent conceptual meaning. Further analysis of these encoding models revealed shared semantic dimensions that underlie concept representations in language and vision. Comparing encoding models trained using representations from multimodal and unimodal transformers, we found that multimodal transformers learn more aligned representations of concepts in language and vision. Our results demonstrate how multimodal transformers can provide insights into the brain's capacity for multimodal processing.
编码模型已被用于评估人类大脑如何在语言和视觉中表征概念。虽然语言和视觉依赖于相似的概念表征,但当前的编码模型通常是在对每种模态的大脑反应进行单独训练和测试的。多模态预训练的最新进展产生了能够提取语言和视觉中概念的对齐表征的变换器。在这项工作中,我们使用多模态变换器的表征来训练编码模型,这些模型可以跨功能磁共振成像对故事和电影的反应进行迁移。我们发现,基于对一种模态的大脑反应训练的编码模型能够成功预测对另一种模态的大脑反应,特别是在表征概念意义的皮层区域。对这些编码模型的进一步分析揭示了语言和视觉中概念表征所基于的共享语义维度。比较使用多模态和单模态变换器的表征训练的编码模型,我们发现多模态变换器学习到语言和视觉中概念的更对齐的表征。我们的结果证明了多模态变换器如何能够为大脑的多模态处理能力提供见解。