• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

源自提示大语言模型的自动心脏磁共振解读。

Automated cardiac magnetic resonance interpretation derived from prompted large language models.

作者信息

Wang Lujing, Peng Liang, Wan Yixuan, Li Xingyu, Chen Yixin, Wang Li, Gong Xiuxian, Zhao Xiaoying, Yu Lequan, Zhao Shihua, Zhao Xinxiang

机构信息

Department of Radiology, The Second Affiliated Hospital of Kunming Medical University, Kunming, China.

Department of Statistics and Actuarial Science, School of Computing and Data Science, The University of Hong Kong, Hong Kong, China.

出版信息

Cardiovasc Diagn Ther. 2025 Aug 30;15(4):726-737. doi: 10.21037/cdt-2025-112. Epub 2025 Aug 28.

DOI:10.21037/cdt-2025-112
PMID:40948711
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12432601/
Abstract

BACKGROUND

The versatility of cardiac magnetic resonance (CMR) leads to complex and time-consuming interpretation. Large language models (LLMs) present transformative potential for automated CMR interpretations. We explored the ability of LLMs in the automated classification and diagnosis of CMR reports for three common cardiac diseases: myocardial infarction (MI), dilated cardiomyopathy (DCM), and hypertrophic cardiomyopathy (HCM).

METHODS

This retrospective study enrolled CMR reports of consecutive patients from January 2015 to July 2024, including reports from three types of cardiac diseases: MI, DCM, and HCM. Six LLMs, including GPT-3.5, GPT-4.0, Gemini-1.0, Gemini-1.5, PaLM, and LLaMA, were used to classify and diagnose the CMR reports. The results of the LLMs, with minimal or informative prompts, were compared with those of radiologists. Accuracy (ACC) and balanced accuracy (BAC) were used to evaluate the classification performance of the different LLMs. The consistency between radiologists and LLMs in classifying heart disease categories was evaluated using Gwet's Agreement Coefficient (AC1 value). Diagnostic performance was analyzed through receiver operating characteristic (ROC) curves. Cohen's kappa was used to assess the reproducibility of the LLMs' diagnostic results obtained at different time intervals (a 30-day interval).

RESULTS

This study enrolled 543 CMR cases, including 275 MI, 120 DCM, and 148 HCM cases. The overall BAC of the minimal prompted LLMs, from highest to lowest, were GPT-4.0, LLaMA, PaLM, GPT-3.5, Gemini-1.5, and Gemini-1.0. The informative prompted models of GPT-3.5 (P<0.001), GPT-4.0 (P<0.001), Gemini-1.0 (P<0.001), Gemini-1.5 (P=0.02), and PaLM (P<0.001) showed significant improvements in overall ACC compared to their minimal prompted models, whereas the informative prompted model of LLaMA did not show a significant improvement in overall ACC compared to the minimal prompted model (P=0.06). GPT-4.0 performed best in both the minimal prompted (ACC =88.6%, BAC =91.7%) and informative prompted (ACC =95.8%, BAC =97.1%) models. GPT-4.0 demonstrated the highest agreement with radiologists [AC1=0.82, 95% confidence interval (CI): 0.78-0.86], significantly outperforming others (P<0.001). For the informative prompted models of LLMs, GPT-4.0 + informative prompt (AC1=0.93, 95% CI: 0.90-0.96), GPT-3.5 + informative prompt (AC1=0.93, 95% CI: 0.90-0.95), Gemini-1.0 + informative prompt (AC1=0.90, 95% CI: 0.87-0.93), PaLM + informative prompt (AC1=0.86, 95% CI: 0.82-0.90), LLaMA + informative prompt (AC1=0.82, 95% CI: 0.78-0.86), and Gemini-1.5 + informative prompt (AC1=0.80, 95% CI: 0.76-0.84) all showed almost perfect agreement with radiologists' diagnoses. Diagnostic performance was excellent for GPT-4.0 [area under the curve (AUC)=0.93, 95% CI: 0.92-0.95] and LLaMA (AUC =0.92, 95% CI: 0.90-0.94) in minimal prompted models, while informative prompted models achieved superior performance, with GPT-4.0 + informative prompt reaching the highest AUC of 0.98 (95% CI: 0.97-0.99). All models demonstrated good reproducibility (κ>0.80, P<0.001).

CONCLUSIONS

LLMs demonstrated outstanding performance in the automated classification and diagnosis of targeted CMR interpretations, especially with informative prompts, suggesting the potential for these models to serve as adjunct tools in CMR diagnostic workflows.

摘要

背景

心脏磁共振成像(CMR)的多功能性导致解读复杂且耗时。大语言模型(LLM)在CMR自动解读方面具有变革潜力。我们探讨了LLM对三种常见心脏病(心肌梗死(MI)、扩张型心肌病(DCM)和肥厚型心肌病(HCM))的CMR报告进行自动分类和诊断的能力。

方法

这项回顾性研究纳入了2015年1月至2024年7月连续患者的CMR报告,包括三种心脏病(MI、DCM和HCM)的报告。使用六个LLM,包括GPT - 3.5、GPT - 4.0、Gemini - 1.0、Gemini - 1.5、PaLM和LLaMA,对CMR报告进行分类和诊断。将LLM在最少提示或信息提示下的结果与放射科医生的结果进行比较。使用准确率(ACC)和平衡准确率(BAC)评估不同LLM的分类性能。使用格维特一致性系数(AC1值)评估放射科医生和LLM在心脏病分类方面的一致性。通过受试者操作特征(ROC)曲线分析诊断性能。使用科恩kappa系数评估LLM在不同时间间隔(30天间隔)获得的诊断结果的可重复性。

结果

本研究纳入了543例CMR病例,包括275例MI、120例DCM和148例HCM病例。最少提示的LLM的总体BAC从高到低依次为GPT - 4.0、LLaMA、PaLM、GPT - 3.5、Gemini - 1.5和Gemini - 1.0。GPT - 3.5(P<0.001)、GPT - 4.0(P<0.001)、Gemini - 1.0(P<0.001)、Gemini - 1.5(P = 0.02)和PaLM(P<0.001)的信息提示模型与最少提示模型相比,总体ACC有显著提高,而LLaMA的信息提示模型与最少提示模型相比,总体ACC没有显著提高(P = 0.06)。GPT - 4.0在最少提示(ACC = 88.6%,BAC = 91.7%)和信息提示(ACC = 95.8%,BAC = 97.1%)模型中表现最佳。GPT - 4.0与放射科医生的一致性最高[AC1 = 0.82,95%置信区间(CI):0.78 - 0.86],显著优于其他模型(P<0.001)。对于LLM的信息提示模型,GPT - 4.0 +信息提示(AC1 = 0.93,95% CI:0.90 - 0.96)、GPT - 3.5 +信息提示(AC1 = 0.93,95% CI:0.90 - 0.95)、Gemini - 1.0 +信息提示(AC1 = 0.90,95% CI:0.87 - 0.93)、PaLM +信息提示(AC1 = 0.86,95% CI:0.82 - 0.90)、LLaMA +信息提示(AC1 = 0.82,95% CI:0.78 - 0.86)和Gemini - 1.5 +信息提示(AC1 = 0.80,95% CI:0.76 - 0.84)与放射科医生的诊断几乎完全一致。GPT - 4.0[曲线下面积(AUC)= 0.93,95% CI:0.92 - 0.95]和LLaMA(AUC = 0.92,95% CI:0.90 - 0.94)在最少提示模型中的诊断性能优异,而信息提示模型表现更优,GPT - 4.0 +信息提示达到最高AUC为0.98(95% CI:0.97 - 0.99)。所有模型均显示出良好的可重复性(κ>0.80,P<0.001)。

结论

LLM在目标CMR解读的自动分类和诊断中表现出色,特别是在有信息提示的情况下,表明这些模型有可能作为CMR诊断工作流程中的辅助工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0460/12432601/a05beb14686f/cdt-15-04-726-f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0460/12432601/c3d9162a0b1a/cdt-15-04-726-f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0460/12432601/6247bfccc534/cdt-15-04-726-f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0460/12432601/52e09502b306/cdt-15-04-726-f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0460/12432601/a05beb14686f/cdt-15-04-726-f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0460/12432601/c3d9162a0b1a/cdt-15-04-726-f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0460/12432601/6247bfccc534/cdt-15-04-726-f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0460/12432601/52e09502b306/cdt-15-04-726-f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0460/12432601/a05beb14686f/cdt-15-04-726-f4.jpg

相似文献

1
Automated cardiac magnetic resonance interpretation derived from prompted large language models.源自提示大语言模型的自动心脏磁共振解读。
Cardiovasc Diagn Ther. 2025 Aug 30;15(4):726-737. doi: 10.21037/cdt-2025-112. Epub 2025 Aug 28.
2
Designing Patient-Centered Communication Aids in Pediatric Surgery Using Large Language Models.使用大语言模型设计儿科手术中以患者为中心的沟通辅助工具
J Pediatr Surg. 2025 Sep 8:162654. doi: 10.1016/j.jpedsurg.2025.162654.
3
Intra-axial primary brain tumor differentiation: comparing large language models on structured MRI reports vs. radiologists on images.轴内原发性脑肿瘤鉴别:比较基于结构化MRI报告的大语言模型与阅片放射科医生的表现
Eur Radiol. 2025 Aug 22. doi: 10.1007/s00330-025-11924-3.
4
Performance analysis of large language models in multi-disease detection from chest computed tomography reports: a comparative study: Experimental Research.基于胸部计算机断层扫描报告的多疾病检测中大型语言模型的性能分析:一项比较研究:实验研究
Int J Surg. 2025 Jun 5. doi: 10.1097/JS9.0000000000002582.
5
Extracting Clinical Guideline Information Using Two Large Language Models: Evaluation Study.使用两个大语言模型提取临床指南信息:评估研究
J Med Internet Res. 2025 Sep 5;27:e73486. doi: 10.2196/73486.
6
Improving large language models for miRNA information extraction via prompt engineering.通过提示工程改进用于miRNA信息提取的大语言模型。
Comput Methods Programs Biomed. 2025 Nov;271:109033. doi: 10.1016/j.cmpb.2025.109033. Epub 2025 Aug 19.
7
Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较:大型语言模型、ChatGPT 和未经训练的急诊医生:一项对比研究。
J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.
8
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
9
Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models.使用大语言模型从临床文档中提取国际疾病分类代码
Appl Clin Inform. 2025 Mar;16(2):337-344. doi: 10.1055/a-2491-3872. Epub 2024 Nov 28.
10
Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.使用GPT-4o和Llama-3.3-70B从自由文本中风CT报告中提取数据:注释指南的影响
Eur Radiol Exp. 2025 Jun 19;9(1):61. doi: 10.1186/s41747-025-00600-2.

本文引用的文献

1
Best Practices for Large Language Models in Radiology.放射学中大型语言模型的最佳实践
Radiology. 2025 Apr;315(1):e240528. doi: 10.1148/radiol.240528.
2
Accuracy of Large Language Model-based Automatic Calculation of Ovarian-Adnexal Reporting and Data System MRI Scores from Pelvic MRI Reports.基于大语言模型从盆腔MRI报告自动计算卵巢附件报告和数据系统MRI评分的准确性
Radiology. 2025 Apr;315(1):e241554. doi: 10.1148/radiol.241554.
3
Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation.迈向用于3D脑CT放射学报告生成的多模态大语言模型的整体框架。
Nat Commun. 2025 Mar 6;16(1):2258. doi: 10.1038/s41467-025-57426-0.
4
Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment.放射学中的开源大语言模型:实践研究与临床应用综述及教程
Radiology. 2025 Jan;314(1):e241073. doi: 10.1148/radiol.241073.
5
Larger and more instructable language models become less reliable.更大且更具指导性的语言模型变得不那么可靠。
Nature. 2024 Oct;634(8032):61-68. doi: 10.1038/s41586-024-07930-y. Epub 2024 Sep 25.
6
Cardiovascular Magnetic Resonance: Past, Present, and Future.心血管磁共振成像:过去、现在与未来。
Circ Cardiovasc Imaging. 2024 Aug;17(8):e016523. doi: 10.1161/CIRCIMAGING.124.016523. Epub 2024 Jul 30.
7
Fifty Years of Cardiovascular Magnetic Resonance: Continuing Evolution Toward the "One-Stop Shop" for Cardiovascular Diagnosis.心血管磁共振成像五十年:向心血管疾病诊断“一站式服务”的持续演进
Circulation. 2024 Jun 11;149(24):1859-1861. doi: 10.1161/CIRCULATIONAHA.124.068244. Epub 2024 Jun 10.
8
Screening and diagnosis of cardiovascular disease using artificial intelligence-enabled cardiac magnetic resonance imaging.人工智能赋能心脏磁共振成像在心血管疾病筛查和诊断中的应用。
Nat Med. 2024 May;30(5):1471-1480. doi: 10.1038/s41591-024-02971-2. Epub 2024 May 13.
9
Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports.开源大语言模型从自由文本放射学报告中提取信息的性能。
Radiol Artif Intell. 2024 Jul;6(4):e230364. doi: 10.1148/ryai.230364.
10
BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study.BI-RADS 类别分配由 GPT-3.5、GPT-4 和谷歌巴德完成:一项多语言研究。
Radiology. 2024 Apr;311(1):e232133. doi: 10.1148/radiol.232133.