School of Medical, Guizhou University, Guiyang, China.
School of Stomatolog, ZunYi Medical University, Zunyi, China.
BMC Med Inform Decis Mak. 2023 Feb 14;23(1):33. doi: 10.1186/s12911-023-02129-z.
Semantic segmentation of brain tumors plays a critical role in clinical treatment, especially for three-dimensional (3D) magnetic resonance imaging, which is often used in clinical practice. Automatic segmentation of the 3D structure of brain tumors can quickly help physicians understand the properties of tumors, such as the shape and size, thus improving the efficiency of preoperative planning and the odds of successful surgery. In past decades, 3D convolutional neural networks (CNNs) have dominated automatic segmentation methods for 3D medical images, and these network structures have achieved good results. However, to reduce the number of neural network parameters, practitioners ensure that the size of convolutional kernels in 3D convolutional operations generally does not exceed [Formula: see text], which also leads to CNNs showing limitations in learning long-distance dependent information. Vision Transformer (ViT) is very good at learning long-distance dependent information in images, but it suffers from the problems of many parameters. What's worse, the ViT cannot learn local dependency information in the previous layers under the condition of insufficient data. However, in the image segmentation task, being able to learn this local dependency information in the previous layers makes a big impact on the performance of the model.
This paper proposes the Swin Unet3D model, which represents voxel segmentation on medical images as a sequence-to-sequence prediction. The feature extraction sub-module in the model is designed as a parallel structure of Convolution and ViT so that all layers of the model are able to adequately learn both global and local dependency information in the image.
On the validation dataset of Brats2021, our proposed model achieves dice coefficients of 0.840, 0.874, and 0.911 on the ET channel, TC channel, and WT channel, respectively. On the validation dataset of Brats2018, our model achieves dice coefficients of 0.716, 0.761, and 0.874 on the corresponding channels, respectively.
We propose a new segmentation model that combines the advantages of Vision Transformer and Convolution and achieves a better balance between the number of model parameters and segmentation accuracy. The code can be found at https://github.com/1152545264/SwinUnet3D .
脑肿瘤的语义分割在临床治疗中起着至关重要的作用,特别是对于三维(3D)磁共振成像,它在临床实践中经常使用。脑肿瘤的 3D 结构的自动分割可以帮助医生快速了解肿瘤的性质,例如形状和大小,从而提高术前规划的效率和手术成功的几率。在过去的几十年中,3D 卷积神经网络(CNN)主导了 3D 医学图像的自动分割方法,这些网络结构取得了很好的效果。然而,为了减少神经网络参数的数量,从业者确保 3D 卷积操作中的卷积核大小一般不超过[公式:见正文],这也导致 CNN 在学习长距离依赖信息方面表现出局限性。Vision Transformer(ViT)非常擅长在图像中学习长距离依赖信息,但它存在参数多的问题。更糟糕的是,在数据不足的情况下,ViT 无法在前层学习局部依赖信息。然而,在图像分割任务中,能够在前层学习到这种局部依赖信息对模型的性能有很大的影响。
本文提出了 Swin Unet3D 模型,它将医学图像上的体素分割表示为序列到序列的预测。模型中的特征提取子模块被设计为卷积和 ViT 的并行结构,使模型的所有层都能够充分学习图像中的全局和局部依赖信息。
在 Brats2021 的验证数据集上,我们提出的模型在 ET 通道、TC 通道和 WT 通道上的骰子系数分别为 0.840、0.874 和 0.911。在 Brats2018 的验证数据集上,我们的模型在相应的通道上的骰子系数分别为 0.716、0.761 和 0.874。
我们提出了一种新的分割模型,结合了 Vision Transformer 和卷积的优势,在模型参数数量和分割精度之间取得了更好的平衡。代码可以在 https://github.com/1152545264/SwinUnet3D 找到。