Yu Alice C, Mohajer Bahram, Eng John
Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, 1800 Orleans St, Baltimore, MD 21287.
Radiol Artif Intell. 2022 May 4;4(3):e210064. doi: 10.1148/ryai.210064. eCollection 2022 May.
To assess generalizability of published deep learning (DL) algorithms for radiologic diagnosis.
In this systematic review, the PubMed database was searched for peer-reviewed studies of DL algorithms for image-based radiologic diagnosis that included external validation, published from January 1, 2015, through April 1, 2021. Studies using nonimaging features or incorporating non-DL methods for feature extraction or classification were excluded. Two reviewers independently evaluated studies for inclusion, and any discrepancies were resolved by consensus. Internal and external performance measures and pertinent study characteristics were extracted, and relationships among these data were examined using nonparametric statistics.
Eighty-three studies reporting 86 algorithms were included. The vast majority (70 of 86, 81%) reported at least some decrease in external performance compared with internal performance, with nearly half (42 of 86, 49%) reporting at least a modest decrease (≥0.05 on the unit scale) and nearly a quarter (21 of 86, 24%) reporting a substantial decrease (≥0.10 on the unit scale). No study characteristics were found to be associated with the difference between internal and external performance.
Among published external validation studies of DL algorithms for image-based radiologic diagnosis, the vast majority demonstrated diminished algorithm performance on the external dataset, with some reporting a substantial performance decrease. Meta-Analysis, Computer Applications-Detection/Diagnosis, Neural Networks, Computer Applications-General (Informatics), Epidemiology, Technology Assessment, Diagnosis, Informatics . © RSNA, 2022.
评估已发表的用于放射诊断的深度学习(DL)算法的通用性。
在本系统评价中,检索了PubMed数据库中2015年1月1日至2021年4月1日发表的关于基于图像的放射诊断的DL算法的同行评审研究,这些研究包括外部验证。排除使用非成像特征或纳入非DL方法进行特征提取或分类的研究。两名评审员独立评估纳入研究,任何分歧通过协商解决。提取内部和外部性能指标以及相关研究特征,并使用非参数统计检验这些数据之间的关系。
纳入了83项报告86种算法的研究。绝大多数(86项中的70项,81%)报告与内部性能相比,外部性能至少有一定程度的下降,近一半(86项中的42项,49%)报告至少有适度下降(单位尺度上≥0.05),近四分之一(86项中的21项,24%)报告有大幅下降(单位尺度上≥0.10)。未发现研究特征与内部和外部性能之间的差异相关。
在已发表的基于图像的放射诊断DL算法的外部验证研究中,绝大多数显示算法在外部数据集上的性能有所下降,有些报告性能大幅下降。荟萃分析、计算机应用-检测/诊断、神经网络、计算机应用-一般(信息学)、流行病学、技术评估、诊断、信息学。©RSNA,2022。