From the Department of Radiology, Herlev and Gentofte Hospital, Borgmester Ib, Juuls vej 1 Herlev, Copenhagen 2730, Denmark (L.L.P., F.C.M., M.W.B., C.H.K., L.C.L., M.B.A.); Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark (L.L.P., M.W.B., C.H.K., M.B., M.B.A.); Radiological Artificial Intelligence Testcenter, RAIT.dk, Herlev, Denmark (L.L.P., F.C.M., M.W.B., C.H.K., M.B., M.B.A.); Department of Radiology, Bispebjerg and Frederiksberg Hospital, Copenhagen, Denmark (M.W.B., M.B.); Department of Radiology, Aarhus University Hospital, Aarhus, Denmark (F.R.); and Department of Cardiology, Bispebjerg and Frederiksberg Hospital, Copenhagen, Denmark (O.W.N.).
Radiology. 2024 Aug;312(2):e240272. doi: 10.1148/radiol.240272.
Background Radiology practices have a high volume of unremarkable chest radiographs and artificial intelligence (AI) could possibly improve workflow by providing an automatic report. Purpose To estimate the proportion of unremarkable chest radiographs, where AI can correctly exclude pathology (ie, specificity) without increasing diagnostic errors. Materials and Methods In this retrospective study, consecutive chest radiographs in unique adult patients (≥18 years of age) were obtained January 1-12, 2020, at four Danish hospitals. Exclusion criteria included insufficient radiology reports or AI output error. Two thoracic radiologists, who were blinded to AI output, labeled chest radiographs as "remarkable" or "unremarkable" based on predefined unremarkable findings (reference standard). Radiology reports were classified similarly. A commercial AI tool was adapted to output a chest radiograph "remarkableness" probability, which was used to calculate specificity at different AI sensitivities. Chest radiographs with missed findings by AI and/or the radiology report were graded by one thoracic radiologist as critical, clinically significant, or clinically insignificant. Paired proportions were compared using the McNemar test. Results A total of 1961 patients were included (median age, 72 years [IQR, 58-81 years]; 993 female), with one chest radiograph per patient. The reference standard labeled 1231 of 1961 chest radiographs (62.8%) as remarkable and 730 of 1961 (37.2%) as unremarkable. At 99.9%, 99.0%, and 98.0% sensitivity, the AI had a specificity of 24.5% (179 of 730 radiographs [95% CI: 21, 28]), 47.1% (344 of 730 radiographs [95% CI: 43, 51]), and 52.7% (385 of 730 radiographs [95% CI: 49, 56]), respectively. With the AI fixed to have a similar sensitivity as radiology reports (87.2%), the missed findings of AI and reports had 2.2% (27 of 1231 radiographs) and 1.1% (14 of 1231 radiographs) classified as critical ( = .01), 4.1% (51 of 1231 radiographs) and 3.6% (44 of 1231 radiographs) classified as clinically significant ( = .46), and 6.5% (80 of 1231) and 8.1% (100 of 1231) classified as clinically insignificant ( = .11), respectively. At sensitivities greater than or equal to 95.4%, the AI tool exhibited less than or equal to 1.1% critical misses. Conclusion A commercial AI tool used off-label could correctly exclude pathology in 24.5%-52.7% of all unremarkable chest radiographs at greater than or equal to 98% sensitivity. The AI had equal or lower rates of critical misses than radiology reports at sensitivities greater than or equal to 95.4%. These results should be confirmed in a prospective study. © RSNA, 2024 See also the editorial by Yoon and Hwang in this issue.
背景 放射科有大量无明显异常的胸部 X 光片,如果人工智能(AI)能够提供自动报告,可能会改善工作流程。目的 估计无明显异常的胸部 X 光片中,AI 能够正确排除病理学(即特异性)而不增加诊断错误的比例。材料与方法 在这项回顾性研究中,连续采集了 2020 年 1 月 1 日至 12 日期间丹麦四家医院的 1961 例独特成年患者(年龄≥18 岁)的胸部 X 光片。排除标准包括放射学报告不充分或 AI 输出错误。两名胸部放射科医生对 AI 输出结果进行了盲法评估,根据预先定义的无明显异常的发现(参考标准)将胸部 X 光片标记为“有意义”或“无意义”。放射学报告也进行了类似的分类。一种商业 AI 工具被改编为输出胸部 X 光片“显著程度”的概率,用于计算不同 AI 敏感度下的特异性。AI 和/或放射学报告遗漏的有意义发现的胸部 X 光片由一名胸部放射科医生进行评估,分为关键、临床显著和临床不显著。使用 McNemar 检验比较配对比例。结果 共纳入 1961 例患者(中位数年龄,72 岁[IQR,58-81 岁];993 例女性),每位患者均有 1 张胸部 X 光片。参考标准将 1961 张胸部 X 光片中的 1231 张(62.8%)标记为有意义,730 张(37.2%)为无意义。在 99.9%、99.0%和 98.0%的敏感度下,AI 的特异性分别为 24.5%(730 张 X 光片中的 179 张[95%CI:21,28])、47.1%(730 张 X 光片中的 344 张[95%CI:43,51])和 52.7%(730 张 X 光片中的 385 张[95%CI:49,56])。当 AI 固定为与放射学报告具有相似的敏感度(87.2%)时,AI 和报告的遗漏发现分别有 2.2%(1231 张 X 光片中的 27 张)和 1.1%(1231 张 X 光片中的 14 张)被归类为关键( =.01),4.1%(1231 张 X 光片中的 51 张)和 3.6%(1231 张 X 光片中的 44 张)被归类为临床显著( =.46),6.5%(1231 张 X 光片中的 80 张)和 8.1%(1231 张 X 光片中的 100 张)被归类为临床不显著( =.11)。在敏感度大于或等于 95.4%时,AI 工具的关键漏诊率小于或等于 1.1%。结论 一种商业 AI 工具在标签外使用时,在大于或等于 98%的敏感度下,可正确排除 24.5%-52.7%的所有无明显异常的胸部 X 光片中的病理学。AI 在敏感度大于或等于 95.4%时的关键漏诊率与放射学报告相等或更低。这些结果应在前瞻性研究中得到证实。