ICube, University of Strasbourg, CNRS, France.
ICube, University of Strasbourg, CNRS, France.
Med Image Anal. 2022 May;78:102433. doi: 10.1016/j.media.2022.102433. Epub 2022 Mar 26.
Out of all existing frameworks for surgical workflow analysis in endoscopic videos, action triplet recognition stands out as the only one aiming to provide truly fine-grained and comprehensive information on surgical activities. This information, presented as 〈instrument, verb, target〉 combinations, is highly challenging to be accurately identified. Triplet components can be difficult to recognize individually; in this task, it requires not only performing recognition simultaneously for all three triplet components, but also correctly establishing the data association between them. To achieve this task, we introduce our new model, the Rendezvous (RDV), which recognizes triplets directly from surgical videos by leveraging attention at two different levels. We first introduce a new form of spatial attention to capture individual action triplet components in a scene; called Class Activation Guided Attention Mechanism (CAGAM). This technique focuses on the recognition of verbs and targets using activations resulting from instruments. To solve the association problem, our RDV model adds a new form of semantic attention inspired by Transformer networks; called Multi-Head of Mixed Attention (MHMA). This technique uses several cross and self attentions to effectively capture relationships between instruments, verbs, and targets. We also introduce CholecT50 - a dataset of 50 endoscopic videos in which every frame has been annotated with labels from 100 triplet classes. Our proposed RDV model significantly improves the triplet prediction mAP by over 9% compared to the state-of-the-art methods on this dataset.
在现有的内镜视频手术流程分析框架中,动作三元组识别是唯一旨在提供关于手术活动的真正细粒度和全面信息的方法。这种以〈器械、动词、目标〉组合形式呈现的信息极难准确识别。三元组组件单独识别可能具有挑战性;在这项任务中,不仅需要同时对所有三个三元组组件进行识别,还需要正确建立它们之间的数据关联。为了实现这一任务,我们引入了我们的新模型 Rendezvous(RDV),该模型通过在两个不同级别利用注意力,直接从手术视频中识别三元组。我们首先引入了一种新的空间注意力形式,用于捕获场景中的单个动作三元组组件,称为类激活引导注意力机制(CAGAM)。该技术专注于使用仪器激活来识别动词和目标。为了解决关联问题,我们的 RDV 模型添加了一种受 Transformer 网络启发的新形式的语义注意力,称为多头混合注意力(MHMA)。该技术使用多个交叉和自注意力来有效捕获仪器、动词和目标之间的关系。我们还引入了 CholecT50——一个包含 50 个内窥镜视频的数据集,其中每个帧都使用来自 100 个三元组类的标签进行了注释。与该数据集上的最新方法相比,我们提出的 RDV 模型显著提高了三元组预测的 mAP,超过了 9%。