IEEE Trans Med Imaging. 2024 Aug;43(8):2803-2813. doi: 10.1109/TMI.2024.3381209. Epub 2024 Aug 1.
Instrument-tissue interaction detection task, which helps understand surgical activities, is vital for constructing computer-assisted surgery systems but with many challenges. Firstly, most models represent instrument-tissue interaction in a coarse-grained way which only focuses on classification and lacks the ability to automatically detect instruments and tissues. Secondly, existing works do not fully consider relations between intra- and inter-frame of instruments and tissues. In the paper, we propose to represent instrument-tissue interaction as 〈 instrument class, instrument bounding box, tissue class, tissue bounding box, action class 〉 quintuple and present an Instrument-Tissue Interaction Detection Network (ITIDNet) to detect the quintuple for surgery videos understanding. Specifically, we propose a Snippet Consecutive Feature (SCF) Layer to enhance features by modeling relationships of proposals in the current frame using global context information in the video snippet. We also propose a Spatial Corresponding Attention (SCA) Layer to incorporate features of proposals between adjacent frames through spatial encoding. To reason relationships between instruments and tissues, a Temporal Graph (TG) Layer is proposed with intra-frame connections to exploit relationships between instruments and tissues in the same frame and inter-frame connections to model the temporal information for the same instance. For evaluation, we build a cataract surgery video (PhacoQ) dataset and a cholecystectomy surgery video (CholecQ) dataset. Experimental results demonstrate the promising performance of our model, which outperforms other state-of-the-art models on both datasets.
器械-组织相互作用检测任务有助于理解手术活动,对于构建计算机辅助手术系统至关重要,但也面临许多挑战。首先,大多数模型以粗粒度的方式表示器械-组织相互作用,仅关注分类,缺乏自动检测器械和组织的能力。其次,现有的工作没有充分考虑器械和组织在帧内和帧间的关系。在本文中,我们提出将器械-组织相互作用表示为〈器械类别、器械边界框、组织类别、组织边界框、动作类别〉五元组,并提出一种器械-组织相互作用检测网络(ITIDNet),用于检测手术视频理解中的五元组。具体来说,我们提出了一个 Snippet Consecutive Feature (SCF) 层,通过使用视频片段中的全局上下文信息来建模当前帧中提议之间的关系,从而增强特征。我们还提出了一个 Spatial Corresponding Attention (SCA) 层,通过空间编码将相邻帧之间的提议特征结合起来。为了推理器械和组织之间的关系,提出了一个 Temporal Graph (TG) 层,其中包含帧内连接,以利用同一帧中器械和组织之间的关系,以及帧间连接,以对同一实例的时间信息进行建模。为了进行评估,我们构建了白内障手术视频(PhacoQ)数据集和胆囊切除术手术视频(CholecQ)数据集。实验结果表明,我们的模型性能优异,在这两个数据集上均优于其他最先进的模型。