用于膝关节骨关节炎中基于MRI评分的高效全器官报告和分类的大语言模型

Large language models for efficient whole-organ MRI score-based reports and categorization in knee osteoarthritis.

作者信息

Xie Yuxue, Hu Zhonghua, Tao Hongyue, Hu Yiwen, Liang Haoyu, Lu Xinmin, Wang Lei, Li Xiangwen, Chen Shuang

机构信息

Department of Radiology & Institute of Medical Functional and Molecular Imaging, Huashan Hospital, Fudan University, Shanghai, China.

Digital & Automation, Siemens Shanghai Medical Equipment Ltd., Shanghai, China.

出版信息

Insights Imaging. 2025 May 14;16(1):100. doi: 10.1186/s13244-025-01976-w.

DOI:10.1186/s13244-025-01976-w

PMID:40366500

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12078906/

Abstract

OBJECTIVES

To evaluate the performance of large language models (LLMs) in automatically generating whole-organ MRI score (WORMS)-based structured MRI reports and predicting osteoarthritis (OA) severity for the knee.

METHODS

A total of 160 consecutive patients suspected of OA were included. Knee MRI reports were reviewed by three radiologists to establish the WORMS reference standard for 39 key features. GPT-4o and GPT-4o-mini were prompted using in-context knowledge (ICK) and chain-of-thought (COT) to generate WORMS-based structured reports from original reports and to automatically predict the OA severity. Four Orthopedic surgeons reviewed original and LLM-generated reports to conduct pairwise preference and difficulty tests, and their review times were recorded.

RESULTS

GPT-4o demonstrated perfect performance in extracting the laterality of the knee (accuracy = 100%). GPT-4o outperformed GPT-4o mini in generating WORMS reports (Accuracy: 93.9% vs 76.2%, respectively). GPT-4o achieved higher recall (87.3% s 46.7%, p < 0.001), while maintaining higher precision compared to GPT-4o mini (94.2% vs 71.2%, p < 0.001). For predicting OA severity, GPT-4o outperformed GPT-4o mini across all prompt strategies (best accuracy: 98.1% vs 68.7%). Surgeons found it easier to extract information and gave more preference to LLM-generated reports over the original reports (both p < 0.001) while spending less time on each report (51.27 ± 9.41 vs 87.42 ± 20.26 s, p < 0.001).

CONCLUSION

GPT-4o generated expert multi-feature, WORMS-based reports from original free-text knee MRI reports. GPT-4o with COT achieved high accuracy in categorizing OA severity. Surgeons reported greater preference and higher efficiency when using LLM-generated reports.

CRITICAL RELEVANCE STATEMENT

The perfect performance of generating WORMS-based reports and the high efficiency and ease of use suggest that integrating LLMs into clinical workflows could greatly enhance productivity and alleviate the documentation burden faced by clinicians in knee OA.

KEY POINTS

GPT-4o successfully generated WORMS-based knee MRI reports. GPT-4o with COT prompting achieved impressive accuracy in categorizing knee OA severity. Greater preference and higher efficiency were reported for LLM-generated reports.

摘要

目的

评估大语言模型（LLMs）在自动生成基于全器官MRI评分（WORMS）的结构化MRI报告以及预测膝关节骨关节炎（OA）严重程度方面的性能。

方法

共纳入160例连续怀疑患有OA的患者。三名放射科医生对膝关节MRI报告进行审查，以建立39个关键特征的WORMS参考标准。使用上下文知识（ICK）和思维链（COT）提示GPT-4o和GPT-4o-mini，从原始报告生成基于WORMS的结构化报告，并自动预测OA严重程度。四名骨科医生审查原始报告和LLM生成的报告，进行成对偏好和难度测试，并记录他们的审查时间。

结果

GPT-4o在提取膝关节侧别方面表现完美（准确率 = 100%）。在生成WORMS报告方面，GPT-4o优于GPT-4o-mini（准确率分别为93.9%和76.2%）。GPT-4o实现了更高的召回率（87.3%对46.7%，p < 0.001），同时与GPT-4o-mini相比保持了更高的精度（94.2%对71.2%，p < 0.001）。在预测OA严重程度方面，GPT-4o在所有提示策略上均优于GPT-4o-mini（最佳准确率：98.1%对68.7%）。外科医生发现从LLM生成的报告中提取信息更容易，并且比原始报告更倾向于LLM生成的报告（两者p < 0.001），同时在每份报告上花费的时间更少（51.27 ± 9.41秒对87.42 ± 20.26秒，p < 0.001）。