Twinanda Andru P, Alkan Emre O, Gangi Afshin, de Mathelin Michel, Padoy Nicolas
ICube Laboratory, University of Strasbourg, CNRS, IHU Strasbourg, Strasbourg, France,
Int J Comput Assist Radiol Surg. 2015 Jun;10(6):737-47. doi: 10.1007/s11548-015-1186-1. Epub 2015 Apr 7.
Context-aware systems for the operating room (OR) provide the possibility to significantly improve surgical workflow through various applications such as efficient OR scheduling, context-sensitive user interfaces, and automatic transcription of medical procedures. Being an essential element of such a system, surgical action recognition is thus an important research area. In this paper, we tackle the problem of classifying surgical actions from video clips that capture the activities taking place in the OR.
We acquire recordings using a multi-view RGBD camera system mounted on the ceiling of a hybrid OR dedicated to X-ray-based procedures and annotate clips of the recordings with the corresponding actions. To recognize the surgical actions from the video clips, we use a classification pipeline based on the bag-of-words (BoW) approach. We propose a novel feature encoding method that extends the classical BoW approach. Instead of using the typical rigid grid layout to divide the space of the feature locations, we propose to learn the layout from the actual 4D spatio-temporal locations of the visual features. This results in a data-driven and non-rigid layout which retains more spatio-temporal information compared to the rigid counterpart.
We classify multi-view video clips from a new dataset generated from 11-day recordings of real operations. This dataset is composed of 1734 video clips of 15 actions. These include generic actions (e.g., moving patient to the OR bed) and actions specific to the vertebroplasty procedure (e.g., hammering). The experiments show that the proposed non-rigid feature encoding method performs better than the rigid encoding one. The classifier's accuracy is increased by over 4 %, from 81.08 to 85.53 %.
The combination of both intensity and depth information from the RGBD data provides more discriminative power in carrying out the surgical action recognition task as compared to using either one of them alone. Furthermore, the proposed non-rigid spatio-temporal feature encoding scheme provides more discriminative histogram representations than the rigid counterpart. To the best of our knowledge, this is also the first work that presents action recognition results on multi-view RGBD data recorded in the OR.
手术室情境感知系统通过高效的手术室调度、情境敏感用户界面和医疗程序自动转录等各种应用,为显著改善手术流程提供了可能性。作为此类系统的一个基本要素,手术动作识别因此成为一个重要的研究领域。在本文中,我们解决了从捕捉手术室中发生活动的视频片段中对手术动作进行分类的问题。
我们使用安装在专门用于基于X射线手术的混合手术室天花板上的多视图RGBD相机系统获取记录,并使用相应动作对记录片段进行标注。为了从视频片段中识别手术动作,我们使用基于词袋(BoW)方法的分类管道。我们提出了一种新颖的特征编码方法,该方法扩展了经典的BoW方法。我们不是使用典型的刚性网格布局来划分特征位置空间,而是建议从视觉特征的实际4D时空位置学习布局。这产生了一种数据驱动的非刚性布局,与刚性对应布局相比,它保留了更多的时空信息。
我们对从11天实际手术记录生成的新数据集中的多视图视频片段进行分类。该数据集由15种动作的1734个视频片段组成。这些动作包括通用动作(例如,将患者移至手术床)和椎体成形术特定动作(例如,锤击)。实验表明,所提出的非刚性特征编码方法比刚性编码方法表现更好。分类器的准确率提高了4%以上,从81.08%提高到85.53%。
与单独使用RGBD数据中的强度信息或深度信息之一相比,两者结合在执行手术动作识别任务时提供了更强的辨别力。此外,所提出的非刚性时空特征编码方案比刚性对应方案提供了更具辨别力的直方图表示。据我们所知,这也是第一项展示在手术室中记录的多视图RGBD数据上的动作识别结果的工作。