Masuda Naohiro, Ono Keiko, Tawara Daisuke, Matsuura Yusuke, Sakabe Kentaro
Master's Program in Information and Computer Science, Doshisha University, Kyoto 610-0394, Japan.
Department of Intelligent Information Engineering and Sciences, Doshisha University, Kyoto 610-0394, Japan.
Sensors (Basel). 2024 Dec 26;25(1):81. doi: 10.3390/s25010081.
The semantic segmentation of bone structures demands pixel-level classification accuracy to create reliable bone models for diagnosis. While Convolutional Neural Networks (CNNs) are commonly used for segmentation, they often struggle with complex shapes due to their focus on texture features and limited ability to incorporate positional information. As orthopedic surgery increasingly requires precise automatic diagnosis, we explored SegFormer, an enhanced Vision Transformer model that better handles spatial awareness in segmentation tasks. However, SegFormer's effectiveness is typically limited by its need for extensive training data, which is particularly challenging in medical imaging, where obtaining labeled ground truths (GTs) is a costly and resource-intensive process. In this paper, we propose two models and their combination to enable accurate feature extraction from smaller datasets by improving SegFormer. Specifically, these include the data-efficient model, which deepens the hierarchical encoder by adding convolution layers to transformer blocks and increases feature map resolution within transformer blocks, and the FPN-based model, which enhances the decoder through a Feature Pyramid Network (FPN) and attention mechanisms. Testing our model on spine images from the Cancer Imaging Archive and our own hand and wrist dataset, ablation studies confirmed that our modifications outperform the original SegFormer, U-Net, and Mask2Former. These enhancements enable better image feature extraction and more precise object contour detection, which is particularly beneficial for medical imaging applications with limited training data.
骨结构的语义分割需要像素级的分类精度,以便创建可靠的骨模型用于诊断。虽然卷积神经网络(CNN)通常用于分割,但由于它们专注于纹理特征且整合位置信息的能力有限,在处理复杂形状时往往存在困难。随着骨科手术对精确自动诊断的需求日益增加,我们探索了SegFormer,这是一种增强的视觉Transformer模型,能在分割任务中更好地处理空间感知。然而,SegFormer的有效性通常受到其对大量训练数据需求的限制,这在医学成像中尤其具有挑战性,因为获取带标注的真实数据(GT)是一个成本高昂且资源密集的过程。在本文中,我们提出了两种模型及其组合,通过改进SegFormer,能够从较小的数据集中进行准确的特征提取。具体来说,这些包括数据高效模型,它通过在Transformer块中添加卷积层来加深层次编码器,并提高Transformer块内的特征图分辨率;以及基于特征金字塔网络(FPN)的模型,它通过特征金字塔网络(FPN)和注意力机制来增强解码器。在来自癌症成像存档的脊柱图像以及我们自己的手部和腕部数据集上测试我们的模型,消融研究证实我们的改进优于原始的SegFormer、U-Net和Mask2Former。这些增强功能能够实现更好的图像特征提取和更精确的对象轮廓检测,这对于训练数据有限的医学成像应用特别有益。