Suppr超能文献

用于膝关节骨关节炎中基于MRI评分的高效全器官报告和分类的大语言模型

Large language models for efficient whole-organ MRI score-based reports and categorization in knee osteoarthritis.

作者信息

Xie Yuxue, Hu Zhonghua, Tao Hongyue, Hu Yiwen, Liang Haoyu, Lu Xinmin, Wang Lei, Li Xiangwen, Chen Shuang

机构信息

Department of Radiology & Institute of Medical Functional and Molecular Imaging, Huashan Hospital, Fudan University, Shanghai, China.

Digital & Automation, Siemens Shanghai Medical Equipment Ltd., Shanghai, China.

出版信息

Insights Imaging. 2025 May 14;16(1):100. doi: 10.1186/s13244-025-01976-w.

Abstract

OBJECTIVES

To evaluate the performance of large language models (LLMs) in automatically generating whole-organ MRI score (WORMS)-based structured MRI reports and predicting osteoarthritis (OA) severity for the knee.

METHODS

A total of 160 consecutive patients suspected of OA were included. Knee MRI reports were reviewed by three radiologists to establish the WORMS reference standard for 39 key features. GPT-4o and GPT-4o-mini were prompted using in-context knowledge (ICK) and chain-of-thought (COT) to generate WORMS-based structured reports from original reports and to automatically predict the OA severity. Four Orthopedic surgeons reviewed original and LLM-generated reports to conduct pairwise preference and difficulty tests, and their review times were recorded.

RESULTS

GPT-4o demonstrated perfect performance in extracting the laterality of the knee (accuracy = 100%). GPT-4o outperformed GPT-4o mini in generating WORMS reports (Accuracy: 93.9% vs 76.2%, respectively). GPT-4o achieved higher recall (87.3% s 46.7%, p < 0.001), while maintaining higher precision compared to GPT-4o mini (94.2% vs 71.2%, p < 0.001). For predicting OA severity, GPT-4o outperformed GPT-4o mini across all prompt strategies (best accuracy: 98.1% vs 68.7%). Surgeons found it easier to extract information and gave more preference to LLM-generated reports over the original reports (both p < 0.001) while spending less time on each report (51.27 ± 9.41 vs 87.42 ± 20.26 s, p < 0.001).

CONCLUSION

GPT-4o generated expert multi-feature, WORMS-based reports from original free-text knee MRI reports. GPT-4o with COT achieved high accuracy in categorizing OA severity. Surgeons reported greater preference and higher efficiency when using LLM-generated reports.

CRITICAL RELEVANCE STATEMENT

The perfect performance of generating WORMS-based reports and the high efficiency and ease of use suggest that integrating LLMs into clinical workflows could greatly enhance productivity and alleviate the documentation burden faced by clinicians in knee OA.

KEY POINTS

GPT-4o successfully generated WORMS-based knee MRI reports. GPT-4o with COT prompting achieved impressive accuracy in categorizing knee OA severity. Greater preference and higher efficiency were reported for LLM-generated reports.

摘要

目的

评估大语言模型(LLMs)在自动生成基于全器官MRI评分(WORMS)的结构化MRI报告以及预测膝关节骨关节炎(OA)严重程度方面的性能。

方法

共纳入160例连续怀疑患有OA的患者。三名放射科医生对膝关节MRI报告进行审查,以建立39个关键特征的WORMS参考标准。使用上下文知识(ICK)和思维链(COT)提示GPT-4o和GPT-4o-mini,从原始报告生成基于WORMS的结构化报告,并自动预测OA严重程度。四名骨科医生审查原始报告和LLM生成的报告,进行成对偏好和难度测试,并记录他们的审查时间。

结果

GPT-4o在提取膝关节侧别方面表现完美(准确率 = 100%)。在生成WORMS报告方面,GPT-4o优于GPT-4o-mini(准确率分别为93.9%和76.2%)。GPT-4o实现了更高的召回率(87.3%对46.7%,p < 0.001),同时与GPT-4o-mini相比保持了更高的精度(94.2%对71.2%,p < 0.001)。在预测OA严重程度方面,GPT-4o在所有提示策略上均优于GPT-4o-mini(最佳准确率:98.1%对68.7%)。外科医生发现从LLM生成的报告中提取信息更容易,并且比原始报告更倾向于LLM生成的报告(两者p < 0.001),同时在每份报告上花费的时间更少(51.27 ± 9.41秒对87.42 ± 20.26秒,p < 0.001)。

结论

GPT-4o从原始的自由文本膝关节MRI报告中生成了专家级的、基于WORMS的多特征报告。具有COT的GPT-4o在对OA严重程度进行分类方面实现了高精度。外科医生报告在使用LLM生成的报告时具有更高的偏好和效率。

关键相关性声明

基于WORMS的报告生成的完美性能以及高效率和易用性表明,将LLMs整合到临床工作流程中可以大大提高生产力,并减轻膝关节OA临床医生面临的文档负担。

要点

GPT-4o成功生成了基于WORMS的膝关节MRI报告。具有COT提示的GPT-4o在对膝关节OA严重程度进行分类方面取得了令人印象深刻的准确率。报告显示对LLM生成的报告有更高的偏好和效率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af6d/12078906/c7f56d9bbfed/13244_2025_1976_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验