IEEE Trans Med Imaging. 2022 Jun;41(6):1311-1319. doi: 10.1109/TMI.2021.3139023. Epub 2022 Jun 1.
Ultrasound imaging is a commonly used technology for visualising patient anatomy in real-time during diagnostic and therapeutic procedures. High operator dependency and low reproducibility make ultrasound imaging and interpretation challenging with a steep learning curve. Automatic image classification using deep learning has the potential to overcome some of these challenges by supporting ultrasound training in novices, as well as aiding ultrasound image interpretation in patient with complex pathology for more experienced practitioners. However, the use of deep learning methods requires a large amount of data in order to provide accurate results. Labelling large ultrasound datasets is a challenging task because labels are retrospectively assigned to 2D images without the 3D spatial context available in vivo or that would be inferred while visually tracking structures between frames during the procedure. In this work, we propose a multi-modal convolutional neural network (CNN) architecture that labels endoscopic ultrasound (EUS) images from raw verbal comments provided by a clinician during the procedure. We use a CNN composed of two branches, one for voice data and another for image data, which are joined to predict image labels from the spoken names of anatomical landmarks. The network was trained using recorded verbal comments from expert operators. Our results show a prediction accuracy of 76% at image level on a dataset with 5 different labels. We conclude that the addition of spoken commentaries can increase the performance of ultrasound image classification, and eliminate the burden of manually labelling large EUS datasets necessary for deep learning applications.
超声成像是一种在诊断和治疗过程中实时可视化患者解剖结构的常用技术。由于操作人员高度依赖和可重复性低,因此超声成像和解释具有挑战性,学习曲线陡峭。使用深度学习进行自动图像分类有潜力克服其中的一些挑战,例如支持新手进行超声培训,以及为经验丰富的从业者提供具有复杂病理的患者的超声图像解释辅助。然而,深度学习方法的使用需要大量的数据才能提供准确的结果。标记大型超声数据集是一项具有挑战性的任务,因为标签是在没有体内 3D 空间背景或在程序中通过视觉跟踪帧之间的结构时推断出的情况下,从 2D 图像回溯分配的。在这项工作中,我们提出了一种多模态卷积神经网络(CNN)架构,该架构从临床医生在程序中提供的原始口头评论中标记内镜超声(EUS)图像。我们使用由两个分支组成的 CNN,一个用于语音数据,另一个用于图像数据,这两个分支结合起来,根据解剖学地标的口头名称预测图像标签。该网络使用来自专家操作人员的记录口头评论进行训练。我们的结果表明,在具有 5 个不同标签的数据集上,图像级别的预测准确率为 76%。我们得出结论,添加口头评论可以提高超声图像分类的性能,并消除为深度学习应用程序手动标记大型 EUS 数据集的负担。