Pan Xiaoying, Gao Xuanrong, Wang Hongyu, Zhang Wuxia, Mu Yuanzhen, He Xianli
School of Computer Science and Technology, Xi'an University of Posts and Telecommunications, GuoDu, Xi'an, 710121, Shaanxi, China.
Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi'an University of Posts and Telecommunications, Xi'an, 710121, Shaanxi, China.
Int J Comput Assist Radiol Surg. 2023 Jan;18(1):139-147. doi: 10.1007/s11548-022-02785-y. Epub 2022 Nov 4.
Surgical workflow recognition has emerged as an important part of computer-assisted intervention systems for the modern operating room, which also is a very challenging problem. Although the CNN-based approach achieves excellent performance, it does not learn global and long-range semantic information interactions well due to the inductive bias inherent in convolution.
In this paper, we propose a temporal-based Swin Transformer network (TSTNet) for the surgical video workflow recognition task. TSTNet contains two main parts: the Swin Transformer and the LSTM. The Swin Transformer incorporates the attention mechanism to encode remote dependencies and learn highly expressive representations. The LSTM is capable of learning long-range dependencies and is used to extract temporal information. The TSTNet organically combines the two components to extract spatiotemporal features that contain more contextual information. In particular, based on a full understanding of the natural features of the surgical video, we propose a priori revision algorithm (PRA) using a priori information about the sequence of the surgical phase. This strategy optimizes the output of TSTNet and further improves the recognition performance.
We conduct extensive experiments using the Cholec80 dataset to validate the effectiveness of the TSTNet-PRA method. Our method achieves excellent performance on the Cholec80 dataset, which accuracy is up to 92.8% and greatly exceeds the state-of-the-art methods.
By modelling remote temporal information and multi-scale visual information, we propose the TSTNet-PRA method. It was evaluated on a large public dataset, showing a high recognition capability superior to other spatiotemporal networks.
手术工作流程识别已成为现代手术室计算机辅助干预系统的重要组成部分,这也是一个极具挑战性的问题。尽管基于卷积神经网络(CNN)的方法取得了优异的性能,但由于卷积固有的归纳偏差,它不能很好地学习全局和远程语义信息交互。
在本文中,我们提出了一种基于时间的Swin Transformer网络(TSTNet)用于手术视频工作流程识别任务。TSTNet包含两个主要部分:Swin Transformer和长短期记忆网络(LSTM)。Swin Transformer结合注意力机制来编码远程依赖并学习高表达性的表示。LSTM能够学习长程依赖并用于提取时间信息。TSTNet将这两个组件有机结合以提取包含更多上下文信息的时空特征。特别是,在充分理解手术视频自然特征的基础上,我们提出了一种使用手术阶段序列先验信息的先验修正算法(PRA)。该策略优化了TSTNet的输出并进一步提高了识别性能。
我们使用Cholec80数据集进行了广泛的实验,以验证TSTNet - PRA方法的有效性。我们的方法在Cholec80数据集上取得了优异的性能,准确率高达92.8%,大大超过了现有最先进的方法。
通过对远程时间信息和多尺度视觉信息进行建模,我们提出了TSTNet - PRA方法。它在一个大型公共数据集上进行了评估,显示出优于其他时空网络的高识别能力。