Department of Micro-Nano Mechanical Science and Engineering, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi, 464-8603, Japan.
Institutes of Innovation for Future Society, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi, 464-8601, Japan.
Int J Comput Assist Radiol Surg. 2024 Jun;19(6):1075-1083. doi: 10.1007/s11548-024-03101-6. Epub 2024 Apr 1.
Purpose Surgical workflow recognition is a challenging task that requires understanding multiple aspects of surgery, such as gestures, phases, and steps. However, most existing methods focus on single-task or single-modal models and rely on costly annotations for training. To address these limitations, we propose a novel semi-supervised learning approach that leverages multimodal data and self-supervision to create meaningful representations for various surgical tasks. Methods Our representation learning approach conducts two processes. In the first stage, time contrastive learning is used to learn spatiotemporal visual features from video data, without any labels. In the second stage, multimodal VAE fuses the visual features with kinematic data to obtain a shared representation, which is fed into recurrent neural networks for online recognition. Results Our method is evaluated on two datasets: JIGSAWS and MISAW. We confirmed that it achieved comparable or better performance in multi-granularity workflow recognition compared to fully supervised models specialized for each task. On the JIGSAWS Suturing dataset, we achieve a gesture recognition accuracy of 83.3%. In addition, our model is more efficient in annotation usage, as it can maintain high performance with only half of the labels. On the MISAW dataset, we achieve 84.0% AD-Accuracy in phase recognition and 56.8% AD-Accuracy in step recognition. Conclusion Our multimodal representation exhibits versatility across various surgical tasks and enhances annotation efficiency. This work has significant implications for real-time decision-making systems within the operating room.
目的
手术流程识别是一项具有挑战性的任务,需要理解手术的多个方面,例如手势、阶段和步骤。然而,大多数现有的方法都侧重于单任务或单模态模型,并依赖于昂贵的标注来进行训练。为了解决这些限制,我们提出了一种新的半监督学习方法,该方法利用多模态数据和自监督来为各种手术任务创建有意义的表示。
方法
我们的表示学习方法进行了两个过程。在第一阶段,时间对比学习用于从视频数据中学习时空视觉特征,而无需任何标签。在第二阶段,多模态 VAE 将视觉特征与运动学数据融合,以获得共享表示,然后将其输入到递归神经网络中进行在线识别。
结果
JIGSAWS 和 MISAW。我们证实,与专门针对每个任务的全监督模型相比,它在多粒度工作流程识别方面实现了可比或更好的性能。在 JIGSAWS 缝合数据集上,我们实现了 83.3%的手势识别准确率。此外,我们的模型在标注使用方面更高效,因为它仅使用一半的标注就可以保持高性能。在 MISAW 数据集上,我们在阶段识别方面实现了 84.0%的 AD-准确率,在步骤识别方面实现了 56.8%的 AD-准确率。
结论
我们的多模态表示在各种手术任务中具有通用性,并提高了标注效率。这项工作对手术室中的实时决策系统具有重要意义。