Xie Yuxue, Hu Zhonghua, Tao Hongyue, Hu Yiwen, Liang Haoyu, Lu Xinmin, Wang Lei, Li Xiangwen, Chen Shuang
Department of Radiology & Institute of Medical Functional and Molecular Imaging, Huashan Hospital, Fudan University, Shanghai, China.
Digital & Automation, Siemens Shanghai Medical Equipment Ltd., Shanghai, China.
Insights Imaging. 2025 May 14;16(1):100. doi: 10.1186/s13244-025-01976-w.
To evaluate the performance of large language models (LLMs) in automatically generating whole-organ MRI score (WORMS)-based structured MRI reports and predicting osteoarthritis (OA) severity for the knee.
A total of 160 consecutive patients suspected of OA were included. Knee MRI reports were reviewed by three radiologists to establish the WORMS reference standard for 39 key features. GPT-4o and GPT-4o-mini were prompted using in-context knowledge (ICK) and chain-of-thought (COT) to generate WORMS-based structured reports from original reports and to automatically predict the OA severity. Four Orthopedic surgeons reviewed original and LLM-generated reports to conduct pairwise preference and difficulty tests, and their review times were recorded.
GPT-4o demonstrated perfect performance in extracting the laterality of the knee (accuracy = 100%). GPT-4o outperformed GPT-4o mini in generating WORMS reports (Accuracy: 93.9% vs 76.2%, respectively). GPT-4o achieved higher recall (87.3% s 46.7%, p < 0.001), while maintaining higher precision compared to GPT-4o mini (94.2% vs 71.2%, p < 0.001). For predicting OA severity, GPT-4o outperformed GPT-4o mini across all prompt strategies (best accuracy: 98.1% vs 68.7%). Surgeons found it easier to extract information and gave more preference to LLM-generated reports over the original reports (both p < 0.001) while spending less time on each report (51.27 ± 9.41 vs 87.42 ± 20.26 s, p < 0.001).
GPT-4o generated expert multi-feature, WORMS-based reports from original free-text knee MRI reports. GPT-4o with COT achieved high accuracy in categorizing OA severity. Surgeons reported greater preference and higher efficiency when using LLM-generated reports.
The perfect performance of generating WORMS-based reports and the high efficiency and ease of use suggest that integrating LLMs into clinical workflows could greatly enhance productivity and alleviate the documentation burden faced by clinicians in knee OA.
GPT-4o successfully generated WORMS-based knee MRI reports. GPT-4o with COT prompting achieved impressive accuracy in categorizing knee OA severity. Greater preference and higher efficiency were reported for LLM-generated reports.
评估大语言模型(LLMs)在自动生成基于全器官MRI评分(WORMS)的结构化MRI报告以及预测膝关节骨关节炎(OA)严重程度方面的性能。
共纳入160例连续怀疑患有OA的患者。三名放射科医生对膝关节MRI报告进行审查,以建立39个关键特征的WORMS参考标准。使用上下文知识(ICK)和思维链(COT)提示GPT-4o和GPT-4o-mini,从原始报告生成基于WORMS的结构化报告,并自动预测OA严重程度。四名骨科医生审查原始报告和LLM生成的报告,进行成对偏好和难度测试,并记录他们的审查时间。
GPT-4o在提取膝关节侧别方面表现完美(准确率 = 100%)。在生成WORMS报告方面,GPT-4o优于GPT-4o-mini(准确率分别为93.9%和76.2%)。GPT-4o实现了更高的召回率(87.3%对46.7%,p < 0.001),同时与GPT-4o-mini相比保持了更高的精度(94.2%对71.2%,p < 0.001)。在预测OA严重程度方面,GPT-4o在所有提示策略上均优于GPT-4o-mini(最佳准确率:98.1%对68.7%)。外科医生发现从LLM生成的报告中提取信息更容易,并且比原始报告更倾向于LLM生成的报告(两者p < 0.001),同时在每份报告上花费的时间更少(51.27 ± 9.41秒对87.42 ± 20.26秒,p < 0.001)。
GPT-4o从原始的自由文本膝关节MRI报告中生成了专家级的、基于WORMS的多特征报告。具有COT的GPT-4o在对OA严重程度进行分类方面实现了高精度。外科医生报告在使用LLM生成的报告时具有更高的偏好和效率。
基于WORMS的报告生成的完美性能以及高效率和易用性表明,将LLMs整合到临床工作流程中可以大大提高生产力,并减轻膝关节OA临床医生面临的文档负担。
GPT-4o成功生成了基于WORMS的膝关节MRI报告。具有COT提示的GPT-4o在对膝关节OA严重程度进行分类方面取得了令人印象深刻的准确率。报告显示对LLM生成的报告有更高的偏好和效率。