Büyüktoka Raşit Eren, Surucu Murat, Erekli Derinkaya Pelin Berfin, Adibelli Zehra Hilal, Salbas Ali, Koc Ali Murat, Buyuktoka Asli Dilara, Isler Yalcın, Ugur Mehmet Alperen, Isiklar Elif
Department of Radiology, Izmir Foça State Hospital, Izmir, Türkiye.
Bucak Computer and Informatics Faculty, Burdur Mehmet Akif Ersoy University, Burdur, Türkiye.
Eur Radiol. 2025 Aug 20. doi: 10.1007/s00330-025-11933-2.
To create and test a locally adapted large language model (LLM) for automated scoring of radiology requisitions based on the reason for exam imaging reporting and data system (RI-RADS), and to evaluate its performance based on reference standards.
This retrospective, double-center study included 131,683 radiology requisitions from two institutions. A bidirectional encoder representation from a transformer (BERT)-based model was trained using 101,563 requisitions from Center 1 (including 1500 synthetic examples) and externally tested on 18,887 requisitions from Center 2. The model's performance for two different classification strategies was evaluated by the reference standard created by three different radiologists. Model performance was assessed using Cohen's Kappa, accuracy, F1-score, sensitivity, and specificity with 95% confidence intervals.
A total of 18,887 requisitions were evaluated for the external test set. External testing yielded a performance with an F1-score of 0.93 (95% CI: 0.912-0.943); κ = 0.88 (95% CI: 0.871-0.884). Performance was highest in common categories RI-RADS D and X (F1 ≥ 0.96) and lowest for rare categories RI-RADS A and B (F1 ≤ 0.49). When grouped into three categories (adequate, inadequate, and unacceptable), overall model performance improved [F1-score = 0.97; (95% CI: 0.96-0.97)].
The locally adapted BERT-based model demonstrated high performance and almost perfect agreement with radiologists in automated RI-RADS scoring, showing promise for integration into radiology workflows to improve requisition completeness and communication.
Question Can an LLM accurately and automatically score radiology requisitions based on standardized criteria to address the challenges of incomplete information in radiological practice? Findings A locally adapted BERT-based model demonstrated high performance (F1-score 0.93) and almost perfect agreement with radiologists in automated RI-RADS scoring across a large, multi-institutional dataset. Clinical relevance LLMs offer a scalable solution for automated scoring of radiology requisitions, with the potential to improve workflow in radiology. Further improvement and integration into clinical practice could enhance communication, contributing to better diagnoses and patient care.
创建并测试一个本地适配的大语言模型(LLM),用于根据检查原因对放射学申请单进行基于影像报告和数据系统(RI-RADS)的自动评分,并根据参考标准评估其性能。
这项回顾性、双中心研究纳入了来自两个机构的131,683份放射学申请单。使用来自中心1的101,563份申请单(包括1500个合成示例)训练了基于变压器双向编码器表征(BERT)的模型,并在来自中心2的18,887份申请单上进行外部测试。通过三位不同放射科医生创建的参考标准评估模型在两种不同分类策略下的性能。使用Cohen's Kappa、准确率、F1分数、敏感度和特异度以及95%置信区间评估模型性能。
对外部测试集共评估了18,887份申请单。外部测试得出的性能为F1分数0.93(95% CI:0.912 - 0.943);κ = 0.88(95% CI:0.871 - 0.884)。在常见类别RI-RADS D和X中性能最高(F1≥0.96),在罕见类别RI-RADS A和B中性能最低(F1≤0.49)。当分为三类(充分、不充分和不可接受)时,整体模型性能有所提高[F1分数 = 0.97;(95% CI:0.96 - 0.97)]。
本地适配的基于BERT的模型在RI-RADS自动评分中表现出高性能,与放射科医生的评分几乎完全一致,显示出有望整合到放射学工作流程中以提高申请单完整性和沟通效果。
问题 一个大语言模型能否根据标准化标准准确自动地对放射学申请单进行评分,以应对放射学实践中信息不完整的挑战? 发现 一个本地适配的基于BERT的模型在一个大型多机构数据集中的RI-RADS自动评分中表现出高性能(F1分数0.93),与放射科医生的评分几乎完全一致。 临床意义 大语言模型为放射学申请单的自动评分提供了一种可扩展的解决方案,有可能改善放射学工作流程。进一步改进并整合到临床实践中可以加强沟通,有助于更好的诊断和患者护理。