Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA.
Int J Comput Assist Radiol Surg. 2021 May;16(5):779-787. doi: 10.1007/s11548-021-02343-y. Epub 2021 Mar 24.
Multi- and cross-modal learning consolidates information from multiple data sources which may offer a holistic representation of complex scenarios. Cross-modal learning is particularly interesting, because synchronized data streams are immediately useful as self-supervisory signals. The prospect of achieving self-supervised continual learning in surgical robotics is exciting as it may enable lifelong learning that adapts to different surgeons and cases, ultimately leading to a more general machine understanding of surgical processes.
We present a learning paradigm using synchronous video and kinematics from robot-mediated surgery. Our approach relies on an encoder-decoder network that maps optical flow to the corresponding kinematics sequence. Clustering on the latent representations reveals meaningful groupings for surgeon gesture and skill level. We demonstrate the generalizability of the representations on the JIGSAWS dataset by classifying skill and gestures on tasks not used for training.
For tasks seen in training, we report a 59 to 70% accuracy in surgical gestures classification. On tasks beyond the training setup, we note a 45 to 65% accuracy. Qualitatively, we find that unseen gestures form clusters in the latent space of novice actions, which may enable the automatic identification of novel interactions in a lifelong learning scenario.
From predicting the synchronous kinematics sequence, optical flow representations of surgical scenes emerge that separate well even for new tasks that the model had not seen before. While the representations are useful immediately for a variety of tasks, the self-supervised learning paradigm may enable research in lifelong and user-specific learning.
多模态和跨模态学习整合了来自多个数据源的信息,这些信息可能提供对复杂场景的整体表示。跨模态学习特别有趣,因为同步数据流立即可以作为自我监督信号使用。在手术机器人中实现自我监督持续学习的前景令人兴奋,因为它可以实现适应不同外科医生和病例的终身学习,最终导致机器对手术过程有更全面的理解。
我们提出了一种使用机器人介导手术中的同步视频和运动学的学习范例。我们的方法依赖于一个编码器-解码器网络,该网络将光流映射到相应的运动学序列。在潜在表示上进行聚类揭示了外科医生手势和技能水平的有意义的分组。我们通过在未用于训练的任务上对技能和手势进行分类,展示了表示的可泛化性在 JIGSAWS 数据集上的应用。
对于在训练中看到的任务,我们报告了 59%到 70%的手术手势分类准确率。在超出训练设置的任务上,我们注意到 45%到 65%的准确率。定性地,我们发现未见过的手势在新手动作的潜在空间中形成聚类,这可能使自动识别终身学习场景中的新交互成为可能。
从预测同步运动学序列中,出现了能够很好地区分的手术场景的光流表示,即使是模型以前没有见过的新任务也是如此。虽然这些表示立即对各种任务都很有用,但自我监督学习范例可能使终身学习和用户特定学习的研究成为可能。