Li Xinchen, Hong Yuan, Xu Yang, Hu Mu
Department of Orthopedics, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China.
Diagnostics (Basel). 2024 Aug 25;14(17):1859. doi: 10.3390/diagnostics14171859.
The accurate and efficient segmentation of the spine is important in the diagnosis and treatment of spine malfunctions and fractures. However, it is still challenging because of large inter-vertebra variations in shape and cross-image localization of the spine. In previous methods, convolutional neural networks (CNNs) have been widely applied as a vision backbone to tackle this task. However, these methods are challenged in utilizing the global contextual information across the whole image for accurate spine segmentation because of the inherent locality of the convolution operation. Compared with CNNs, the Vision Transformer (ViT) has been proposed as another vision backbone with a high capacity to capture global contextual information. However, when the ViT is employed for spine segmentation, it treats all input tokens equally, including vertebrae-related tokens and non-vertebrae-related tokens. Additionally, it lacks the capability to locate regions of interest, thus lowering the accuracy of spine segmentation. To address this limitation, we propose a novel Vertebrae-aware Vision Transformer (VerFormer) for automatic spine segmentation from CT images. Our VerFormer is designed by incorporating a novel Vertebrae-aware Global (VG) block into the ViT backbone. In the VG block, the vertebrae-related global contextual information is extracted by a Vertebrae-aware Global Query (VGQ) module. Then, this information is incorporated into query tokens to highlight vertebrae-related tokens in the multi-head self-attention module. Thus, this VG block can leverage global contextual information to effectively and efficiently locate spines across the whole input, thus improving the segmentation accuracy of VerFormer. Driven by this design, the VerFormer demonstrates a solid capacity to capture more discriminative dependencies and vertebrae-related context in automatic spine segmentation. The experimental results on two spine CT segmentation tasks demonstrate the effectiveness of our VG block and the superiority of our VerFormer in spine segmentation. Compared with other popular CNN- or ViT-based segmentation models, our VerFormer shows superior segmentation accuracy and generalization.
脊柱的准确高效分割在脊柱功能障碍和骨折的诊断与治疗中至关重要。然而,由于脊柱形状的椎骨间差异较大以及在图像中的跨图像定位,这一任务仍然具有挑战性。在以往的方法中,卷积神经网络(CNN)已被广泛用作视觉主干来处理此任务。然而,由于卷积操作固有的局部性,这些方法在利用整个图像的全局上下文信息进行准确的脊柱分割时面临挑战。与CNN相比,视觉Transformer(ViT)已被提出作为另一种具有高能力捕获全局上下文信息的视觉主干。然而,当将ViT用于脊柱分割时,它平等对待所有输入令牌,包括与椎骨相关的令牌和与非椎骨相关的令牌。此外,它缺乏定位感兴趣区域的能力,从而降低了脊柱分割的准确性。为了解决这一限制,我们提出了一种新颖的椎体感知视觉Transformer(VerFormer),用于从CT图像中自动分割脊柱。我们的VerFormer通过将一个新颖的椎体感知全局(VG)块合并到ViT主干中进行设计。在VG块中,通过一个椎体感知全局查询(VGQ)模块提取与椎骨相关的全局上下文信息。然后,该信息被合并到查询令牌中,以在多头自注意力模块中突出与椎骨相关的令牌。因此,这个VG块可以利用全局上下文信息在整个输入中有效且高效地定位脊柱,从而提高VerFormer的分割准确性。受此设计驱动,VerFormer在自动脊柱分割中展现出强大的能力,能够捕获更多有区分性的依赖关系和与椎骨相关的上下文。在两个脊柱CT分割任务上的实验结果证明了我们的VG块的有效性以及我们的VerFormer在脊柱分割中的优越性。与其他基于CNN或ViT的流行分割模型相比,我们的VerFormer表现出更高的分割准确性和泛化能力。