Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS), Department of Computer Science, University College, London, UK.
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, HK, China.
Int J Comput Assist Radiol Surg. 2022 Dec;17(12):2193-2202. doi: 10.1007/s11548-022-02743-8. Epub 2022 Sep 21.
Real-time surgical workflow analysis has been a key component for computer-assisted intervention system to improve cognitive assistance. Most existing methods solely rely on conventional temporal models and encode features with a successive spatial-temporal arrangement. Supportive benefits of intermediate features are partially lost from both visual and temporal aspects. In this paper, we rethink feature encoding to attend and preserve the critical information for accurate workflow recognition and anticipation.
We introduce Transformer in surgical workflow analysis, to reconsider complementary effects of spatial and temporal representations. We propose a hybrid embedding aggregation Transformer, named Trans-SVNet, to effectively interact with the designed spatial and temporal embeddings, by employing spatial embedding to query temporal embedding sequence. We jointly optimized by loss objectives from both analysis tasks to leverage their high correlation.
We extensively evaluate our method on three large surgical video datasets. Our method consistently outperforms the state-of-the-arts across three datasets on workflow recognition task. Jointly learning with anticipation, recognition results can gain a large improvement. Our approach also shows its effectiveness on anticipation with promising performance achieved. Our model achieves a real-time inference speed of 0.0134 second per frame.
Experimental results demonstrate the efficacy of our hybrid embeddings integration by rediscovering the crucial cues from complementary spatial-temporal embeddings. The better performance by multi-task learning indicates that anticipation task brings the additional knowledge to recognition task. Promising effectiveness and efficiency of our method also show its promising potential to be used in operating room.
实时手术工作流程分析是计算机辅助干预系统提高认知辅助的关键组成部分。大多数现有方法仅依赖于传统的时间模型,并通过连续的时空排列来对特征进行编码。从视觉和时间方面来看,中间特征的辅助作用都部分丢失了。在本文中,我们重新考虑特征编码,以关注和保留对准确工作流程识别和预测至关重要的信息。
我们在手术工作流程分析中引入了 Transformer,以重新考虑空间和时间表示的互补效应。我们提出了一种名为 Trans-SVNet 的混合嵌入聚合 Transformer,通过使用空间嵌入来查询时间嵌入序列,有效地与设计的空间和时间嵌入进行交互。我们通过来自两个分析任务的损失目标进行联合优化,以利用它们之间的高度相关性。
我们在三个大型手术视频数据集上对我们的方法进行了广泛评估。我们的方法在三个数据集上的工作流程识别任务中均优于最先进的方法。与预测一起进行联合学习,识别结果可以得到很大的提高。我们的方法在预测方面也表现出了有效性,取得了有前景的性能。我们的模型实现了 0.0134 秒/帧的实时推理速度。
实验结果表明,通过重新发现互补时空嵌入中的关键线索,我们的混合嵌入集成方法是有效的。通过多任务学习获得的更好性能表明,预测任务为识别任务带来了额外的知识。我们的方法具有有前景的有效性和效率,也表明了其在手术室中应用的潜力。