Wang Hongqiu, Yang Guang, Zhang Shichen, Qin Jing, Guo Yike, Xu Bo, Jin Yueming, Zhu Lei
IEEE Trans Med Imaging. 2024 Dec;43(12):4457-4469. doi: 10.1109/TMI.2024.3426953. Epub 2024 Dec 2.
Surgical instrument segmentation is fundamentally important for facilitating cognitive intelligence in robot-assisted surgery. Although existing methods have achieved accurate instrument segmentation results, they simultaneously generate segmentation masks of all instruments, which lack the capability to specify a target object and allow an interactive experience. This paper focuses on a novel and essential task in robotic surgery, i.e., Referring Surgical Video Instrument Segmentation (RSVIS), which aims to automatically identify and segment the target surgical instruments from each video frame, referred by a given language expression. This interactive feature offers enhanced user engagement and customized experiences, greatly benefiting the development of the next generation of surgical education systems. To achieve this, this paper constructs two surgery video datasets to promote the RSVIS research. Then, we devise a novel Video-Instrument Synergistic Network (VIS-Net) to learn both video-level and instrument-level knowledge to boost performance, while previous work only utilized video-level information. Meanwhile, we design a Graph-based Relation-aware Module (GRM) to model the correlation between multi-modal information (i.e., textual description and video frame) to facilitate the extraction of instrument-level information. Extensive experimental results on two RSVIS datasets exhibit that the VIS-Net can significantly outperform existing state-of-the-art referring segmentation methods. We will release our code and dataset for future research (https://github.com/whq-xxh/RSVIS).
手术器械分割对于促进机器人辅助手术中的认知智能至关重要。尽管现有方法已取得准确的器械分割结果,但它们同时生成所有器械的分割掩码,缺乏指定目标对象并提供交互式体验的能力。本文聚焦于机器人手术中一项新颖且重要的任务,即参考手术视频器械分割(RSVIS),其旨在根据给定的语言表达从每个视频帧中自动识别并分割出目标手术器械。这种交互式功能增强了用户参与度并提供了定制化体验,极大地有利于下一代手术教育系统的发展。为实现这一目标,本文构建了两个手术视频数据集以推动RSVIS研究。然后,我们设计了一种新颖的视频 - 器械协同网络(VIS - Net)来学习视频级和器械级知识以提升性能,而先前的工作仅利用了视频级信息。同时,我们设计了一个基于图的关系感知模块(GRM)来对多模态信息(即文本描述和视频帧)之间的相关性进行建模,以促进器械级信息的提取。在两个RSVIS数据集上的大量实验结果表明,VIS - Net能够显著优于现有的最先进的参考分割方法。我们将发布代码和数据集以供未来研究使用(https://github.com/whq-xxh/RSVIS)。