Department of Mechanical Engineering, The University of Tokyo, Tokyo, Japan.
Int J Comput Assist Radiol Surg. 2020 Aug;15(8):1257-1265. doi: 10.1007/s11548-020-02185-0. Epub 2020 May 22.
The manual generation of training data for the semantic segmentation of medical images using deep neural networks is a time-consuming and error-prone task. In this paper, we investigate the effect of different levels of realism on the training of deep neural networks for semantic segmentation of robotic instruments. An interactive virtual-reality environment was developed to generate synthetic images for robot-aided endoscopic surgery. In contrast with earlier works, we use physically based rendering for increased realism.
Using a virtual reality simulator that replicates our robotic setup, three synthetic image databases with an increasing level of realism were generated: flat, basic, and realistic (using the physically-based rendering). Each of those databases was used to train 20 instances of a UNet-based semantic-segmentation deep-learning model. The networks trained with only synthetic images were evaluated on the segmentation of 160 endoscopic images of a phantom. The networks were compared using the Dwass-Steel-Critchlow-Fligner nonparametric test.
Our results show that the levels of realism increased the mean intersection-over-union (mIoU) of the networks on endoscopic images of a phantom ([Formula: see text]). The median mIoU values were 0.235 for the flat dataset, 0.458 for the basic, and 0.729 for the realistic. All the networks trained with synthetic images outperformed naive classifiers. Moreover, in an ablation study, we show that the mIoU of physically based rendering is superior to texture mapping ([Formula: see text]) of the instrument (0.606), the background (0.685), and the background and instruments combined (0.672).
Using physical-based rendering to generate synthetic images is an effective approach to improve the training of neural networks for the semantic segmentation of surgical instruments in endoscopic images. Our results show that this strategy can be an essential step in the broad applicability of deep neural networks in semantic segmentation tasks and help bridge the domain gap in machine learning.
使用深度神经网络对医学图像进行语义分割的训练数据的手动生成是一项耗时且容易出错的任务。在本文中,我们研究了不同逼真度水平对机器人器械语义分割的深度神经网络训练的影响。开发了一个交互式虚拟现实环境,用于生成机器人辅助内窥镜手术的合成图像。与早期的工作相比,我们使用基于物理的渲染来提高逼真度。
使用复制我们机器人设置的虚拟现实模拟器,生成了三个逼真度递增的合成图像数据库:平面、基础和真实(使用基于物理的渲染)。每个数据库都用于训练 20 个基于 UNet 的语义分割深度学习模型的实例。仅使用合成图像训练的网络在 160 个幻影内窥镜图像的分割上进行评估。使用 Dwass-Steel-Critchlow-Fligner 非参数检验比较网络。
我们的结果表明,逼真度水平提高了网络在幻影内窥镜图像上的平均交并比(mIoU)([Formula: see text])。平面数据集的中位数 mIoU 值为 0.235,基础数据集为 0.458,真实数据集为 0.729。所有使用合成图像训练的网络都优于朴素分类器。此外,在一项消融研究中,我们表明基于物理的渲染的 mIoU 优于仪器的纹理映射([Formula: see text])(0.606)、背景(0.685)和背景与仪器的组合(0.672)。
使用基于物理的渲染生成合成图像是一种有效的方法,可以提高深度神经网络在内窥镜图像中对手术器械语义分割的训练效果。我们的结果表明,这种策略可能是深度神经网络在语义分割任务中的广泛应用的关键步骤,并有助于弥合机器学习中的领域差距。