IEEE Trans Med Imaging. 2017 Jul;36(7):1542-1549. doi: 10.1109/TMI.2017.2665671. Epub 2017 Feb 8.
Video understanding of robot-assisted surgery (RAS) videos is an active research area. Modeling the gestures and skill level of surgeons presents an interesting problem. The insights drawn may be applied in effective skill acquisition, objective skill assessment, real-time feedback, and human-robot collaborative surgeries. We propose a solution to the tool detection and localization open problem in RAS video understanding, using a strictly computer vision approach and the recent advances of deep learning. We propose an architecture using multimodal convolutional neural networks for fast detection and localization of tools in RAS videos. To the best of our knowledge, this approach will be the first to incorporate deep neural networks for tool detection and localization in RAS videos. Our architecture applies a region proposal network (RPN) and a multimodal two stream convolutional network for object detection to jointly predict objectness and localization on a fusion of image and temporal motion cues. Our results with an average precision of 91% and a mean computation time of 0.1 s per test frame detection indicate that our study is superior to conventionally used methods for medical imaging while also emphasizing the benefits of using RPN for precision and efficiency. We also introduce a new data set, ATLAS Dione, for RAS video understanding. Our data set provides video data of ten surgeons from Roswell Park Cancer Institute, Buffalo, NY, USA, performing six different surgical tasks on the daVinci Surgical System (dVSS) with annotations of robotic tools per frame.
机器人辅助手术(RAS)视频的理解是一个活跃的研究领域。对外科医生的手势和技能水平进行建模是一个有趣的问题。从中得出的见解可以应用于有效的技能获取、客观的技能评估、实时反馈和人机协作手术。我们提出了一种使用严格的计算机视觉方法和深度学习的最新进展来解决 RAS 视频理解中的工具检测和定位难题的解决方案。我们提出了一种使用多模态卷积神经网络的架构,用于快速检测和定位 RAS 视频中的工具。据我们所知,这种方法将是第一个将深度学习网络用于 RAS 视频中的工具检测和定位的方法。我们的架构应用了区域提议网络(RPN)和多模态双流卷积网络进行目标检测,以联合预测图像和时间运动线索融合上的目标性和定位。我们的结果平均精度为 91%,每个测试帧的平均计算时间为 0.1 秒,这表明我们的研究优于传统的医学成像方法,同时也强调了使用 RPN 提高精度和效率的好处。我们还引入了一个新的数据集 ATLAS Dione,用于 RAS 视频理解。我们的数据集提供了来自美国纽约州布法罗市罗斯韦尔公园癌症研究所的十位外科医生的视频数据,他们在达芬奇手术系统(dVSS)上执行了六个不同的手术任务,每个框架都有机器人工具的注释。