College of Electrical Engineering, Sichuan University, Chengdu 610065, China.
The Center of Psychosomatic Medicine, Sichuan Provincial Center for Mental Health, Sichuan Provincial People's Hospital, University of Electronic Science and Technology of China, Chengdu 611731, China; High-Field Magnetic Resonance Brain Imaging Key Laboratory of Sichuan Province, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
Comput Methods Programs Biomed. 2022 Feb;214:106586. doi: 10.1016/j.cmpb.2021.106586. Epub 2021 Dec 14.
Most studies used neural activities evoked by linguistic stimuli such as phrases or sentences to decode the language structure. However, compared to linguistic stimuli, it is more common for the human brain to perceive the outside world through non-linguistic stimuli such as natural images, so only relying on linguistic stimuli cannot fully understand the information perceived by the human brain. To address this, an end-to-end mapping model between visual neural activities evoked by non-linguistic stimuli and visual contents is demanded.
Inspired by the success of the Transformer network in neural machine translation and the convolutional neural network (CNN) in computer vision, here a CNN-Transformer hybrid language decoding model is constructed in an end-to-end fashion to decode functional magnetic resonance imaging (fMRI) signals evoked by natural images into descriptive texts about the visual stimuli. Specifically, this model first encodes a semantic sequence extracted by a two-layer 1D CNN from the multi-time visual neural activity into a multi-level abstract representation, then decodes this representation, step by step, into an English sentence.
Experimental results show that the decoded texts are semantically consistent with the corresponding ground truth annotations. Additionally, by varying the encoding and decoding layers and modifying the original positional encoding of the Transformer, we found that a specific architecture of the Transformer is required in this work.
The study results indicate that the proposed model can decode the visual neural activities evoked by natural images into descriptive text about the visual stimuli in the form of sentences. Hence, it may be considered as a potential computer-aided tool for neuroscientists to understand the neural mechanism of visual information processing in the human brain in the future.
大多数研究使用语言刺激(如短语或句子)引起的神经活动来解码语言结构。然而,与语言刺激相比,人类大脑通过自然图像等非语言刺激感知外部世界更为常见,因此仅依靠语言刺激无法完全理解大脑感知到的信息。为了解决这个问题,需要建立一个从非语言刺激引起的视觉神经活动到视觉内容的端到端映射模型。
受 Transformer 网络在神经机器翻译和卷积神经网络(CNN)在计算机视觉中的成功启发,本文以端到端的方式构建了一个 CNN-Transformer 混合语言解码模型,将自然图像引起的功能磁共振成像(fMRI)信号解码为描述视觉刺激的文本。具体来说,该模型首先使用两层 1D CNN 从多时间视觉神经活动中提取语义序列,并将其编码为多层次的抽象表示,然后逐步将该表示解码为英语句子。
实验结果表明,解码后的文本在语义上与相应的真实标注一致。此外,通过改变编码和解码层并修改 Transformer 的原始位置编码,我们发现该工作中需要特定的 Transformer 架构。
研究结果表明,所提出的模型可以将自然图像引起的视觉神经活动解码为描述视觉刺激的句子形式的文本。因此,它可以被视为未来神经科学家理解人类大脑中视觉信息处理神经机制的一种潜在的计算机辅助工具。