Suppr超能文献

通过脑-视觉-语言特征的多模态学习解码视觉神经表示。

Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Sep;45(9):10760-10777. doi: 10.1109/TPAMI.2023.3263181. Epub 2023 Aug 7.

Abstract

Decoding human visual neural representations is a challenging task with great scientific significance in revealing vision-processing mechanisms and developing brain-like intelligent machines. Most existing methods are difficult to generalize to novel categories that have no corresponding neural data for training. The two main reasons are 1) the under-exploitation of the multimodal semantic knowledge underlying the neural data and 2) the small number of paired (stimuli-responses) training data. To overcome these limitations, this paper presents a generic neural decoding method called BraVL that uses multimodal learning of brain-visual-linguistic features. We focus on modeling the relationships between brain, visual and linguistic features via multimodal deep generative models. Specifically, we leverage the mixture-of-product-of-experts formulation to infer a latent code that enables a coherent joint generation of all three modalities. To learn a more consistent joint representation and improve the data efficiency in the case of limited brain activity data, we exploit both intra- and inter-modality mutual information maximization regularization terms. In particular, our BraVL model can be trained under various semi-supervised scenarios to incorporate the visual and textual features obtained from the extra categories. Finally, we construct three trimodal matching datasets, and the extensive experiments lead to some interesting conclusions and cognitive insights: 1) decoding novel visual categories from human brain activity is practically possible with good accuracy; 2) decoding models using the combination of visual and linguistic features perform much better than those using either of them alone; 3) visual perception may be accompanied by linguistic influences to represent the semantics of visual stimuli.

摘要

人类视觉神经表示的解码是一项具有重大科学意义的挑战任务,它可以揭示视觉处理机制并开发类脑智能机器。大多数现有的方法都难以推广到没有相应神经数据进行训练的新类别。主要有两个原因:1)对神经数据底层的多模态语义知识利用不足,2)配对(刺激-反应)训练数据数量少。为了克服这些限制,本文提出了一种称为 BraVL 的通用神经解码方法,该方法使用脑-视觉-语言特征的多模态学习。我们专注于通过多模态深度生成模型来建模大脑、视觉和语言特征之间的关系。具体来说,我们利用混合专家产品公式来推断潜在代码,以实现所有三种模态的连贯联合生成。为了学习更一致的联合表示并在大脑活动数据有限的情况下提高数据效率,我们利用了模态内和模态间互信息最大化正则化项。特别是,我们的 BraVL 模型可以在各种半监督场景下进行训练,以整合从额外类别获得的视觉和文本特征。最后,我们构建了三个三模态匹配数据集,并进行了广泛的实验,得出了一些有趣的结论和认知见解:1)从人类大脑活动中解码新的视觉类别是可行的,具有很好的准确性;2)使用视觉和语言特征组合的解码模型比单独使用它们中的任何一个的模型性能更好;3)视觉感知可能伴随着语言影响,以表示视觉刺激的语义。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验