Department of Electrical, Biomedical and Computer Engineering, Toronto Metropolitan University (formerly Ryerson University), 350 Victoria Street, Toronto, M5B 2K3, Ontario, Canada.
Department of Medical Imaging, University of Saskatchewan, 105 Administration Pl, Saskatoon, SK S7N0W8, Saskatchewan, Canada.
J Imaging Inform Med. 2024 Oct;37(5):2669-2687. doi: 10.1007/s10278-024-01108-8. Epub 2024 Apr 15.
Convolutional neural networks (CNN) have been used for a wide variety of deep learning applications, especially in computer vision. For medical image processing, researchers have identified certain challenges associated with CNNs. These challenges encompass the generation of less informative features, limitations in capturing both high and low-frequency information within feature maps, and the computational cost incurred when enhancing receptive fields by deepening the network. Transformers have emerged as an approach aiming to address and overcome these specific limitations of CNNs in the context of medical image analysis. Preservation of all spatial details of medical images is necessary to ensure accurate patient diagnosis. Hence, this research introduced the use of a pure Vision Transformer (ViT) for a denoising artificial neural network for medical image processing specifically for low-dose computed tomography (LDCT) image denoising. The proposed model follows a U-Net framework that contains ViT modules with the integration of Noise2Neighbor (N2N) interpolation operation. Five different datasets containing LDCT and normal-dose CT (NDCT) image pairs were used to carry out this experiment. To test the efficacy of the proposed model, this experiment includes comparisons between the quantitative and visual results among CNN-based (BM3D, RED-CNN, DRL-E-MP), hybrid CNN-ViT-based (TED-Net), and the proposed pure ViT-based denoising model. The findings of this study showed that there is about 15-20% increase in SSIM and PSNR when using self-attention transformers than using the typical pure CNN. Visual results also showed improvements especially when it comes to showing fine structural details of CT images.
卷积神经网络(CNN)已被广泛应用于深度学习应用,特别是在计算机视觉领域。对于医学图像处理,研究人员已经确定了与 CNN 相关的某些挑战。这些挑战包括生成信息量较少的特征、在特征图中捕获高频和低频信息的局限性,以及通过加深网络来增强感受野所带来的计算成本。Transformers 作为一种方法出现,旨在解决和克服医学图像分析中 CNN 的这些特定限制。保留医学图像的所有空间细节对于确保准确的患者诊断是必要的。因此,这项研究引入了纯 Vision Transformer(ViT)在医学图像处理中用于去噪人工神经网络的应用,特别是用于低剂量计算机断层扫描(LDCT)图像去噪。所提出的模型遵循 U-Net 框架,其中包含具有 Noise2Neighbor(N2N)插值操作集成的 ViT 模块。这项实验使用了包含 LDCT 和正常剂量 CT(NDCT)图像对的五个不同数据集。为了测试所提出模型的功效,该实验包括基于 CNN(BM3D、RED-CNN、DRL-E-MP)、混合 CNN-ViT(TED-Net)和所提出的纯 ViT 去噪模型的定量和视觉结果之间的比较。该研究的结果表明,与使用典型的纯 CNN 相比,使用自注意力转换器可以将 SSIM 和 PSNR 提高约 15-20%。视觉结果也显示出了改进,特别是在显示 CT 图像的精细结构细节方面。