The Chinese University of Hong Kong, Hong Kong, China.
Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS), University College London, London, UK.
Int J Comput Assist Radiol Surg. 2024 Jun;19(6):1013-1020. doi: 10.1007/s11548-024-03083-5. Epub 2024 Mar 8.
Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation.
We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene.
Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation.
Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly.
机器人手术中的深度估计对于 3D 重建、手术导航和增强现实可视化至关重要。虽然基础模型在许多视觉任务中表现出色,包括深度估计(例如 DINOv2),但最近的研究发现其在医学和外科领域特定应用中的局限性。本研究提出了一种针对外科深度估计的基础模型低秩自适应(LoRA)方法。
我们设计了一种基于基础模型的深度估计方法,称为 Surgical-DINO,这是 DINOv2 的低秩自适应方法,用于内窥镜手术中的深度估计。我们构建了 LoRA 层并将其集成到 DINO 中,以适应手术特定领域的知识,而不是传统的微调。在训练过程中,我们冻结 DINO 图像编码器,该编码器具有出色的视觉表示能力,仅优化 LoRA 层和深度解码器,以整合来自手术场景的特征。
我们的模型在 MICCAI 挑战数据集 SCARED 上进行了广泛验证,该数据集是从达芬奇 Xi 内窥镜手术中收集的。我们的实证研究表明,Surgical-DINO 在内窥镜深度估计任务中显著优于所有最先进的模型。通过消融研究的分析,证明了我们的 LoRA 层和自适应的显著效果。
Surgical-DINO 为基础模型成功适应外科领域的深度估计提供了一些启示。研究结果清楚地表明,在计算机视觉数据集上使用预训练权重进行零样本预测或简单的微调不足以直接将基础模型用于外科领域。