Kim Songsoo, Kim Donghyun, Shin Hyun Joo, Lee Seung Hyun, Kang Yeseul, Jeong Sejin, Kim Jaewoong, Han Miran, Lee Seong-Joon, Kim Joonho, Yum Jungyon, Han Changho, Yoon Dukyong
From the Departments of Biomedical Systems Informatics (S.K., Jaewoong Kim, C.H., D.Y.) and Neurology (Joonho Kim, J.Y.), Yonsei University College of Medicine, 50-1 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea; Department of Radiology, Central Draft Physical Examination Office of Military Manpower Administration, Daegu, Republic of Korea (D.K.); Department of Radiology, Research Institute of Radiological Science and Center for Clinical Imaging Data Science (H.J.S. Y.K., S.J.), and Center for Digital Health (H.J.S., D.Y.), Yongin Severance Hospital, Yonsei University College of Medicine, Yongin, Republic of Korea; Department of Radiology, Gangnam Severance Hospital, Yonsei University College of Medicine, Seoul, Republic of Korea (S.H.L.); Departments of Radiology (M.H.) and Neurology (S.J.L.), Ajou University Hospital, Ajou University School of Medicine, Suwon, Republic of Korea; and Institute for Innovation in Digital Healthcare, Severance Hospital, Seoul, Republic of Korea (D.Y.).
Radiology. 2025 Jan;314(1):e240701. doi: 10.1148/radiol.240701.
Background The increasing workload of radiologists can lead to burnout and errors in radiology reports. Large language models, such as OpenAI's GPT-4, hold promise as error revision tools for radiology. Purpose To test the feasibility of GPT-4 use by determining its error detection, reasoning, and revision performance on head CT reports with varying error types and to validate its clinical utility by comparison with human readers. Materials and Methods A total of 10 300 head CT reports were retrospectively extracted from the Medical Information Mart for Intensive Care III public dataset. In experiment 1, among the 300 unaltered reports and 300 versions with applied errors, GPT-4 optimization was initially conducted with 200 reports. The remaining 400 were used for evaluation of error type detection, reasoning, and revision, as well as the analysis of reports with undetected errors. The performance was also compared with that of human readers. In experiment 2, the detection performance of GPT-4 was validated on 10 000 unaltered reports that were deemed error-free by physicians, and an analysis of false-positive results was conducted. A permutation test was conducted to assess differences in performance. Results GPT-4 demonstrated commendable performance in error detection (sensitivity, 84% for interpretive error and 89% for factual error), reasoning, and revision. Compared with GPT-4, human readers had worse factual error detection sensitivity (0.33-0.69 vs 0.89; = .008 for radiologist 4, < .001 for others) and took longer to review (82-121 seconds vs 16 seconds, < .001). In 10 000 reports, GPT-4 detected 96 errors, with a low positive predictive value of 0.05, yet 14% of the false-positive responses were potentially beneficial. Conclusion GPT-4 effectively detects, reasons, and revises errors in radiology reports. While it shows excellent performance in identifying factual errors, its ability to prioritize clinically significant findings is limited. Recognizing its strengths and limitations, GPT-4 could serve as a feasible tool. © RSNA, 2025 See also the editorial by Choi in this issue.
背景 放射科医生工作量的增加可能导致职业倦怠和放射学报告中的错误。大型语言模型,如OpenAI的GPT-4,有望成为放射学报告的错误修订工具。目的 通过确定GPT-4在不同错误类型的头部CT报告上的错误检测、推理和修订性能来测试其使用的可行性,并通过与人类读者比较来验证其临床实用性。材料与方法 从重症监护医学信息库III公共数据集中回顾性提取了总共10300份头部CT报告。在实验1中,在300份未改动的报告和300份有故意错误的报告中,最初用200份报告对GPT-4进行优化。其余400份用于评估错误类型检测、推理和修订,以及对未检测到错误的报告进行分析。其性能也与人类读者的性能进行了比较。在实验2中,在10000份被医生判定为无错误的未改动报告上验证了GPT-4的检测性能,并对假阳性结果进行了分析。进行了排列检验以评估性能差异。结果 GPT-4在错误检测(解释性错误的灵敏度为84%,事实性错误的灵敏度为89%)、推理和修订方面表现出值得称赞的性能。与GPT-4相比,人类读者的事实性错误检测灵敏度较差(放射科医生4为0.33 - 0.69对0.89;P = 0.008,其他人为P < 0.001),且审查时间更长(82 - 121秒对16秒,P < 0.001)。在10000份报告中,GPT-4检测到96个错误,阳性预测值较低,为0.05,但14%的假阳性反应可能有益。结论 GPT-4能有效检测、推理和修订放射学报告中的错误。虽然它在识别事实性错误方面表现出色,但其对临床重要发现进行优先级排序的能力有限。认识到其优势和局限性,GPT-4可作为一种可行的工具。© RSNA, 2025 另见本期Choi的社论。