Terzis Robert, Kaya Kenan, Schömig Thomas, Janssen Jan Paul, Iuga Andra-Iza, Kottlors Jonathan, Lennartz Simon, Gietzen Carsten, Gözdas Cansin, Müller Lukas, Hahnfeldt Robert, Maintz David, Dratsch Thomas, Pennig Lenhard
Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany.
Department of Diagnostic and Interventional Radiology, University Medical Center of the Johannes Gutenberg-University, Mainz, Germany.
Eur Radiol. 2025 Aug 8. doi: 10.1007/s00330-025-11888-4.
This study evaluated GPT-4's accuracy in MRI sequence selection based on radiology request forms (RRFs), comparing its performance to radiology residents.
This retrospective study included 100 RRFs across four subspecialties (cardiac imaging, neuroradiology, musculoskeletal, and oncology). GPT-4 and two radiology residents (R1: 2 years, R2: 5 years MRI experience) selected sequences based on each patient's medical history and clinical questions. Considering imaging society guidelines, five board-certified specialized radiologists assessed protocols based on completeness, quality, and utility in consensus, using 5-point Likert scales. Clinical applicability was rated binarily by the institution's lead radiographer.
GPT-4 achieved median scores of 3 (1-5) for completeness, 4 (1-5) for quality, and 4 (1-5) for utility, comparable to R1 (3 (1-5), 4 (1-5), 4 (1-5); each p > 0.05) but inferior to R2 (4 (1-5), 5 (1-5); p < 0.01, respectively, and 5 (1-5); p < 0.001). Subspecialty protocol quality varied: GPT-4 matched R1 (4 (2-4) vs. 4 (2-5), p = 0.20) and R2 (4 (2-5); p = 0.47) in cardiac imaging; showed no differences in neuroradiology (all 5 (1-5), p > 0.05); scored lower than R1 and R2 in musculoskeletal imaging (3 (2-5) vs. 4 (3-5); p < 0.01, and 5 (3-5); p < 0.001); and matched R1 (4 (1-5) vs. 2 (1-4), p = 0.12) as well as R2 (5 (2-5); p = 0.20) in oncology. GPT-4-based protocols were clinically applicable in 95% of cases, comparable to R1 (95%) and R2 (96%).
GPT-4 generated MRI protocols with notable completeness, quality, utility, and clinical applicability, excelling in standardized subspecialties like cardiac and neuroradiology imaging while yielding lower accuracy in musculoskeletal examinations.
Question Long MRI acquisition times limit patient access, making accurate protocol selection crucial for efficient diagnostics, though it's time-consuming and error-prone, especially for inexperienced residents. Findings GPT-4 generated MRI protocols of remarkable yet inconsistent quality, performing on par with an experienced resident in standardized fields, but moderately in musculoskeletal examinations. Clinical relevance The large language model can assist less experienced radiologists in determining detailed MRI protocols and counteract increasing workloads. The model could function as a semi-automatic tool, generating MRI protocols for radiologists' confirmation, optimizing resource allocation, and improving diagnostics and cost-effectiveness.
本研究评估了GPT-4基于放射学申请表(RRFs)选择MRI序列的准确性,并将其表现与放射科住院医师进行比较。
这项回顾性研究纳入了来自四个亚专业(心脏成像、神经放射学、肌肉骨骼和肿瘤学)的100份RRFs。GPT-4和两名放射科住院医师(R1:2年经验,R2:5年MRI经验)根据每位患者的病史和临床问题选择序列。根据影像学会指南,五位获得委员会认证的专业放射科医生使用5分制李克特量表,基于完整性、质量和实用性对方案进行共识评估。临床适用性由该机构的首席放射技师进行二元评级。
GPT-4在完整性方面的中位数评分为3(1 - 5),质量方面为4(1 - 5),实用性方面为4(1 - 5),与R1(3(1 - 5),4(1 - 5),4(1 - 5);各p>0.05)相当,但不如R2(分别为4(1 - 5),5(1 - 5);p<0.01,以及5(1 - 5);p<0.001)。亚专业方案质量各不相同:GPT-4在心脏成像方面与R1(4(2 - 4)对4(2 - 5),p = 0.20)和R2(4(2 - 5);p = 0.47)相当;在神经放射学方面无差异(均为5(1 - 5),p>0.05);在肌肉骨骼成像方面得分低于R1和R2(3(2 - 5)对4(3 - 5);p<0.01,以及5(3 - 5);p<0.001);在肿瘤学方面与R1(4(1 - 5)对2(1 - 4),p = 0.12)以及R2(5(2 - 5);p = 0.20)相当。基于GPT-4的方案在95%的病例中具有临床适用性,与R1(95%)和R2(96%)相当。
GPT-4生成的MRI方案具有显著的完整性、质量、实用性和临床适用性,在心脏和神经放射学成像等标准化亚专业中表现出色,而在肌肉骨骼检查中的准确性较低。
问题MRI采集时间长限制了患者的检查机会,准确选择方案对高效诊断至关重要,尽管这既耗时又容易出错,尤其是对于经验不足的住院医师。发现GPT-4生成的MRI方案质量显著但不一致,在标准化领域与经验丰富的住院医师表现相当,但在肌肉骨骼检查中表现中等。临床意义大语言模型可以帮助经验不足的放射科医生确定详细的MRI方案,并应对不断增加的工作量。该模型可作为半自动工具,生成MRI方案供放射科医生确认,优化资源分配,提高诊断效率和成本效益。