Shirakawa Ken, Nagano Yoshihiro, Tanaka Misato, Aoki Shuntaro C, Muraki Yusuke, Majima Kei, Kamitani Yukiyasu
Graduate School of Informatics, Kyoto University, Sakyo-ku, Kyoto, 606-8501, Japan; Computational Neuroscience Laboratories, Advanced Telecommunications Research Institute International, Seika, Soraku, Kyoto, 619-0288, Japan.
Graduate School of Informatics, Kyoto University, Sakyo-ku, Kyoto, 606-8501, Japan; Computational Neuroscience Laboratories, Advanced Telecommunications Research Institute International, Seika, Soraku, Kyoto, 619-0288, Japan.
Neural Netw. 2025 Oct;190:107515. doi: 10.1016/j.neunet.2025.107515. Epub 2025 May 27.
Advances in brain decoding, particularly in visual image reconstruction, have sparked discussions about the societal implications and ethical considerations of neurotechnology. As reconstruction methods aim to recover visual experiences from brain activity and achieve prediction beyond training samples (zero-shot prediction), it is crucial to assess their capabilities and limitations to inform public expectations and regulations. Our case study of recent text-guided reconstruction methods, which leverage a large-scale dataset (Natural Scenes Dataset, NSD) and text-to-image diffusion models, reveals critical limitations in their generalizability, demonstrated by poor reconstructions on a different dataset. UMAP visualization of the text features from NSD images shows limited diversity with overlapping semantic and visual clusters between training and test sets. We identify that clustered training samples can lead to "output dimension collapse," restricting predictable output feature dimensions. While diverse training data improves generalization over the entire feature space without requiring exponential scaling, text features alone prove insufficient for mapping to the visual space. Our findings suggest that the apparent realism in current text-guided reconstructions stems from a combination of classification into trained categories and inauthentic image generation (hallucination) through diffusion models, rather than genuine visual reconstruction. We argue that careful selection of datasets and target features, coupled with rigorous evaluation methods, is essential for achieving authentic visual image reconstruction. These insights underscore the importance of grounding interdisciplinary discussions in a thorough understanding of the technology's current capabilities and limitations to ensure responsible development.
大脑解码技术的进展,尤其是视觉图像重建方面的进展,引发了关于神经技术的社会影响和伦理考量的讨论。由于重建方法旨在从大脑活动中恢复视觉体验并实现超越训练样本的预测(零样本预测),评估其能力和局限性对于引导公众期望和制定相关规定至关重要。我们对近期文本引导的重建方法进行的案例研究,这些方法利用了大规模数据集(自然场景数据集,NSD)和文本到图像的扩散模型,揭示了它们在泛化能力方面的关键局限性,这在不同数据集上的糟糕重建结果中得到了体现。对NSD图像的文本特征进行UMAP可视化显示,训练集和测试集之间语义和视觉聚类存在重叠,多样性有限。我们发现聚类的训练样本会导致“输出维度坍缩”,限制了可预测的输出特征维度。虽然多样化的训练数据无需指数级扩展就能在整个特征空间上提高泛化能力,但仅靠文本特征不足以映射到视觉空间。我们的研究结果表明,当前文本引导的重建中看似逼真的效果源于对训练类别进行分类以及通过扩散模型生成虚假图像(幻觉)的结合,而非真正的视觉重建。我们认为,精心选择数据集和目标特征,再加上严格的评估方法,对于实现真实的视觉图像重建至关重要。这些见解强调了在深入理解该技术当前的能力和局限性的基础上进行跨学科讨论的重要性,以确保其负责任的发展。