Meng Lu, Yang Chuanhao
College of Information Science and Engineering, Northeastern University, Shenyang 110819, China.
Bioengineering (Basel). 2023 Sep 24;10(10):1117. doi: 10.3390/bioengineering10101117.
The reconstruction of visual stimuli from fMRI signals, which record brain activity, is a challenging task with crucial research value in the fields of neuroscience and machine learning. Previous studies tend to emphasize reconstructing pixel-level features (contours, colors, etc.) or semantic features (object category) of the stimulus image, but typically, these properties are not reconstructed together. In this context, we introduce a novel three-stage visual reconstruction approach called the Dual-guided Brain Diffusion Model (DBDM). Initially, we employ the Very Deep Variational Autoencoder (VDVAE) to reconstruct a coarse image from fMRI data, capturing the underlying details of the original image. Subsequently, the Bootstrapping Language-Image Pre-training (BLIP) model is utilized to provide a semantic annotation for each image. Finally, the image-to-image generation pipeline of the Versatile Diffusion (VD) model is utilized to recover natural images from the fMRI patterns guided by both visual and semantic information. The experimental results demonstrate that DBDM surpasses previous approaches in both qualitative and quantitative comparisons. In particular, the best performance is achieved by DBDM in reconstructing the semantic details of the original image; the Inception, CLIP and SwAV distances are 0.611, 0.225 and 0.405, respectively. This confirms the efficacy of our model and its potential to advance visual decoding research.
从记录大脑活动的功能磁共振成像(fMRI)信号中重建视觉刺激,是神经科学和机器学习领域一项具有关键研究价值的挑战性任务。以往的研究往往侧重于重建刺激图像的像素级特征(轮廓、颜色等)或语义特征(物体类别),但通常这些属性不会一起重建。在此背景下,我们引入了一种新颖的三阶段视觉重建方法,称为双引导脑扩散模型(DBDM)。首先,我们使用超深度变分自编码器(VDVAE)从fMRI数据中重建一幅粗糙图像,捕捉原始图像的潜在细节。随后,利用自训练语言-图像预训练(BLIP)模型为每幅图像提供语义注释。最后,利用通用扩散(VD)模型的图像到图像生成管道,在视觉和语义信息的引导下从fMRI模式中恢复自然图像。实验结果表明,在定性和定量比较中,DBDM均优于以往的方法。特别是,DBDM在重建原始图像的语义细节方面表现最佳;Inception、CLIP和SwAV距离分别为0.611、0.225和0.405。这证实了我们模型的有效性及其推进视觉解码研究的潜力。