Suppr超能文献

用于手术手势识别的基于零样本提示的视频编码器

Zero-shot prompt-based video encoder for surgical gesture recognition.

作者信息

Rao Mingxing, Qin Yinhong, Kolouri Soheil, Wu Jie Ying, Moyer Daniel

机构信息

Department of Computer Science, Vanderbilt University, Nashville, USA.

出版信息

Int J Comput Assist Radiol Surg. 2025 Feb;20(2):311-321. doi: 10.1007/s11548-024-03257-1. Epub 2024 Sep 17.

Abstract

PURPOSE

In order to produce a surgical gesture recognition system that can support a wide variety of procedures, either a very large annotated dataset must be acquired, or fitted models must generalize to new labels (so-called zero-shot capability). In this paper we investigate the feasibility of latter option.

METHODS

Leveraging the bridge-prompt framework, we prompt-tune a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside video data such as text, but also make use of label meta-data and weakly supervised contrastive losses.

RESULTS

Our experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks. Notably, it displays strong performance in zero-shot scenarios, where gestures/tasks that were not provided during the encoder training phase are included in the prediction phase. Additionally, we measure the benefit of inclusion text descriptions in the feature extractor training schema.

CONCLUSION

Bridge-prompt and similar pre-trained + prompt-tuned video encoder models present significant visual representation for surgical robotics, especially in gesture recognition tasks. Given the diverse range of surgical tasks (gestures), the ability of these models to zero-shot transfer without the need for any task (gesture) specific retraining makes them invaluable.

摘要

目的

为了构建一个能够支持多种手术操作的手术手势识别系统,要么获取非常大的带注释数据集,要么使拟合模型能够推广到新的标签(即所谓的零样本能力)。在本文中,我们研究了后一种选择的可行性。

方法

利用桥接提示框架,我们对预训练的视觉-文本模型(CLIP)进行提示调整,以用于手术视频中的手势识别。这既可以利用诸如文本等大量外部视频数据,也可以利用标签元数据和弱监督对比损失。

结果

我们的实验表明,基于提示的视频编码器在手术手势识别任务中优于标准编码器。值得注意的是,它在零样本场景中表现出强大的性能,即在预测阶段包含编码器训练阶段未提供的手势/任务。此外,我们衡量了在特征提取器训练模式中包含文本描述的益处。

结论

桥接提示和类似的预训练+提示调整的视频编码器模型为手术机器人提供了重要的视觉表示,特别是在手势识别任务中。鉴于手术任务(手势)的多样性,这些模型无需任何特定于任务(手势)的再训练即可进行零样本转移的能力使其具有极高的价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c7e1/11807915/2730b56b2774/11548_2024_3257_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验