Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology, 3200003, Haifa, Israel.
Applied Mathematics Department, Technion - Israel Institute of Technology, 3200003, Haifa, Israel.
Int J Comput Assist Radiol Surg. 2022 Aug;17(8):1497-1505. doi: 10.1007/s11548-022-02691-3. Epub 2022 Jun 27.
The goal of this work is to use multi-camera video to classify open surgery tools as well as identify which tool is held in each hand. Multi-camera systems help prevent occlusions in open surgery video data. Furthermore, combining multiple views such as a top-view camera covering the full operative field and a close-up camera focusing on hand motion and anatomy may provide a more comprehensive view of the surgical workflow. However, multi-camera data fusion poses a new challenge: A tool may be visible in one camera and not the other. Thus, we defined the global ground truth as the tools being used regardless their visibility. Therefore, tools that are out of the image should be remembered for extensive periods of time while the system responds quickly to changes visible in the video.
Participants (n = 48) performed a simulated open bowel repair. A top-view and a close-up cameras were used. YOLOv5 was used for tool and hand detection. A high-frequency LSTM with a 1-second window at 30 frames per second (fps) and a low-frequency LSTM with a 40-second window at 3 fps were used for spatial, temporal, and multi-camera integration.
The accuracy and F1 of the six systems were: top-view (0.88/0.88), close-up (0.81,0.83), both cameras (0.9/0.9), high-fps LSTM (0.92/0.93), low-fps LSTM (0.9/0.91), and our final architecture the multi-camera classifier(0.93/0.94).
Since each camera in a multi-camera system may have a partial view of the procedure, we defined a 'global ground truth.' Defining this at the data labeling phase emphasized this requirement at the learning phase, eliminating the need for any heuristic decisions. By combining a system with a high fps and a low fps from the multiple camera array, we improved the classification abilities of the global ground truth.
本工作旨在使用多摄像机视频对开放式手术工具进行分类,并识别每只手中所持的工具。多摄像机系统有助于防止开放式手术视频数据中的遮挡。此外,结合多个视图,例如覆盖整个手术区域的顶视图摄像机和聚焦于手部运动和解剖结构的特写摄像机,可能提供更全面的手术流程视图。然而,多摄像机数据融合带来了新的挑战:工具可能在一个摄像机中可见,而在另一个摄像机中不可见。因此,我们将全局地面实况定义为正在使用的工具,无论其可见性如何。因此,即使工具不在图像中,也应该长时间记住它们,而系统应该快速响应视频中可见的变化。
参与者(n=48)进行了模拟的开放式肠道修复。使用了顶视图和特写摄像机。使用 YOLOv5 进行工具和手部检测。使用高频 LSTM(每秒 30 帧,窗口为 1 秒)和低频 LSTM(每秒 3 帧,窗口为 40 秒)进行空间、时间和多摄像机集成。
六个系统的准确性和 F1 值分别为:顶视图(0.88/0.88)、特写视图(0.81,0.83)、两个摄像机(0.9/0.9)、高帧率 LSTM(0.92/0.93)、低帧率 LSTM(0.9/0.91)和我们的最终架构多摄像机分类器(0.93/0.94)。
由于多摄像机系统中的每个摄像机可能对手术过程具有局部视图,因此我们定义了“全局地面实况”。在数据标记阶段定义此内容强调了学习阶段的这一要求,从而无需任何启发式决策。通过结合来自多摄像机阵列的高帧率和低帧率系统,我们提高了全局地面实况的分类能力。