Department of Radiological Imaging and Informatics, Tohoku University Graduate School of Medicine, Sendai, Japan.
National Institute of Technology, Sendai College, Sendai, Japan.
J Imaging Inform Med. 2024 Jun;37(3):1-10. doi: 10.1007/s10278-024-00974-6. Epub 2024 Feb 9.
Drowning diagnosis is a complicated process in the autopsy, even with the assistance of autopsy imaging and the on-site information from where the body was found. Previous studies have developed well-performed deep learning (DL) models for drowning diagnosis. However, the validity of the DL models was not assessed, raising doubts about whether the learned features accurately represented the medical findings observed by human experts. In this paper, we assessed the medical validity of DL models that had achieved high classification performance for drowning diagnosis. This retrospective study included autopsy cases aged 8-91 years who underwent postmortem computed tomography between 2012 and 2021 (153 drowning and 160 non-drowning cases). We first trained three deep learning models from a previous work and generated saliency maps that highlight important features in the input. To assess the validity of models, pixel-level annotations were created by four radiological technologists and further quantitatively compared with the saliency maps. All the three models demonstrated high classification performance with areas under the receiver operating characteristic curves of 0.94, 0.97, and 0.98, respectively. On the other hand, the assessment results revealed unexpected inconsistency between annotations and models' saliency maps. In fact, each model had, respectively, around 30%, 40%, and 80% of irrelevant areas in the saliency maps, suggesting the predictions of the DL models might be unreliable. The result alerts us in the careful assessment of DL tools, even those with high classification performance.
溺死的诊断在尸检中是一个复杂的过程,即使有尸检成像和发现尸体的现场信息的协助也是如此。先前的研究已经开发出性能良好的深度学习(DL)模型用于溺死的诊断。然而,这些 DL 模型的有效性尚未得到评估,这让人怀疑所学习到的特征是否准确地代表了人类专家观察到的医学发现。在本文中,我们评估了那些在溺死诊断中取得了高分类性能的 DL 模型的医学有效性。这项回顾性研究纳入了 2012 年至 2021 年间进行死后计算机断层扫描的年龄为 8-91 岁的尸检病例(153 例溺死和 160 例非溺死病例)。我们首先从之前的工作中训练了三个深度学习模型,并生成了突显图,突出输入中的重要特征。为了评估模型的有效性,由四位放射技术人员创建了像素级别的标注,并进一步与突显图进行定量比较。所有三个模型的分类性能均很高,受试者工作特征曲线下的面积分别为 0.94、0.97 和 0.98。另一方面,评估结果显示标注与模型的突显图之间存在出人意料的不一致。事实上,每个模型的突显图中分别有大约 30%、40%和 80%的不相关区域,这表明 DL 模型的预测可能不可靠。该结果提醒我们在对 DL 工具进行仔细评估时,即使是那些具有高分类性能的工具也需要谨慎。