DIA2M, DRCI, CHU Clermont-Ferrand, Clermont-Ferrand, France.
Int J Comput Assist Radiol Surg. 2024 Jun;19(6):1157-1163. doi: 10.1007/s11548-024-03125-y. Epub 2024 Apr 12.
We investigate whether foundation models pretrained on diverse visual data could be beneficial to surgical computer vision. We use instrument and uterus segmentation in mini-invasive procedures as benchmarks. We propose multiple supervised, unsupervised and few-shot supervised adaptations of foundation models, including two novel adaptation methods.
We use DINOv1, DINOv2, DINOv2 with registers, and SAM backbones, with the ART-Net surgical instrument and the SurgAI3.8K uterus segmentation datasets. We investigate five approaches: DINO unsupervised, few-shot learning with a linear decoder, supervised learning with the proposed DINO-UNet adaptation, DPT with DINO encoder, and unsupervised learning with the proposed SAM adaptation.
We evaluate 17 models for instrument segmentation and 7 models for uterus segmentation and compare to existing ad hoc models for the tasks at hand. We show that the linear decoder can be learned with few shots. The unsupervised and linear decoder methods obtain slightly subpar results but could be considered useful in data scarcity settings. The unsupervised SAM model produces finer edges but has inconsistent outputs. However, DPT and DINO-UNet obtain strikingly good results, defining a new state of the art by outperforming the previous-best by 5.6 and 4.1 pp for instrument and 4.4 and 1.5 pp for uterus segmentation. Both methods obtain semantic and spatial precision, accurately segmenting intricate details.
Our results show the huge potential of using DINO and SAM for surgical computer vision, indicating a promising role for visual foundation models in medical image analysis, particularly in scenarios with limited or complex data.
我们研究了在多样化视觉数据上预训练的基础模型是否对手术计算机视觉有益。我们以微创手术中的器械和子宫分割作为基准。我们提出了多种监督、无监督和少量监督的基础模型适应方法,包括两种新的适应方法。
我们使用 DINOv1、DINOv2、带有注册的 DINOv2 和 SAM 骨干,以及 ART-Net 手术器械和 SurgAI3.8K 子宫分割数据集。我们研究了五种方法:DINO 无监督、带有线性解码器的少量学习、使用提出的 DINO-UNet 适应的监督学习、带有 DINO 编码器的 DPT 和带有提出的 SAM 适应的无监督学习。
我们评估了 17 个用于器械分割的模型和 7 个用于子宫分割的模型,并与现有针对手头任务的特定模型进行了比较。我们表明,可以通过少量样本学习线性解码器。无监督和线性解码器方法的结果略差,但在数据稀缺的情况下可能被认为是有用的。无监督的 SAM 模型产生了更精细的边缘,但输出不一致。然而,DPT 和 DINO-UNet 获得了惊人的好结果,在器械分割方面分别比之前的最佳方法提高了 5.6 和 4.1 个百分点,在子宫分割方面提高了 4.4 和 1.5 个百分点。这两种方法都获得了语义和空间精度,能够准确地分割复杂的细节。
我们的结果表明,使用 DINO 和 SAM 进行手术计算机视觉具有巨大的潜力,表明视觉基础模型在医学图像分析中具有广阔的应用前景,特别是在数据有限或复杂的情况下。