源自提示大语言模型的自动心脏磁共振解读。

Automated cardiac magnetic resonance interpretation derived from prompted large language models.

作者信息

Wang Lujing, Peng Liang, Wan Yixuan, Li Xingyu, Chen Yixin, Wang Li, Gong Xiuxian, Zhao Xiaoying, Yu Lequan, Zhao Shihua, Zhao Xinxiang

机构信息

Department of Radiology, The Second Affiliated Hospital of Kunming Medical University, Kunming, China.

Department of Statistics and Actuarial Science, School of Computing and Data Science, The University of Hong Kong, Hong Kong, China.

出版信息

Cardiovasc Diagn Ther. 2025 Aug 30;15(4):726-737. doi: 10.21037/cdt-2025-112. Epub 2025 Aug 28.

DOI:10.21037/cdt-2025-112

PMID:40948711

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12432601/

Abstract

BACKGROUND

The versatility of cardiac magnetic resonance (CMR) leads to complex and time-consuming interpretation. Large language models (LLMs) present transformative potential for automated CMR interpretations. We explored the ability of LLMs in the automated classification and diagnosis of CMR reports for three common cardiac diseases: myocardial infarction (MI), dilated cardiomyopathy (DCM), and hypertrophic cardiomyopathy (HCM).

METHODS

This retrospective study enrolled CMR reports of consecutive patients from January 2015 to July 2024, including reports from three types of cardiac diseases: MI, DCM, and HCM. Six LLMs, including GPT-3.5, GPT-4.0, Gemini-1.0, Gemini-1.5, PaLM, and LLaMA, were used to classify and diagnose the CMR reports. The results of the LLMs, with minimal or informative prompts, were compared with those of radiologists. Accuracy (ACC) and balanced accuracy (BAC) were used to evaluate the classification performance of the different LLMs. The consistency between radiologists and LLMs in classifying heart disease categories was evaluated using Gwet's Agreement Coefficient (AC1 value). Diagnostic performance was analyzed through receiver operating characteristic (ROC) curves. Cohen's kappa was used to assess the reproducibility of the LLMs' diagnostic results obtained at different time intervals (a 30-day interval).

RESULTS

This study enrolled 543 CMR cases, including 275 MI, 120 DCM, and 148 HCM cases. The overall BAC of the minimal prompted LLMs, from highest to lowest, were GPT-4.0, LLaMA, PaLM, GPT-3.5, Gemini-1.5, and Gemini-1.0. The informative prompted models of GPT-3.5 (P<0.001), GPT-4.0 (P<0.001), Gemini-1.0 (P<0.001), Gemini-1.5 (P=0.02), and PaLM (P<0.001) showed significant improvements in overall ACC compared to their minimal prompted models, whereas the informative prompted model of LLaMA did not show a significant improvement in overall ACC compared to the minimal prompted model (P=0.06). GPT-4.0 performed best in both the minimal prompted (ACC =88.6%, BAC =91.7%) and informative prompted (ACC =95.8%, BAC =97.1%) models. GPT-4.0 demonstrated the highest agreement with radiologists [AC1=0.82, 95% confidence interval (CI): 0.78-0.86], significantly outperforming others (P<0.001). For the informative prompted models of LLMs, GPT-4.0 + informative prompt (AC1=0.93, 95% CI: 0.90-0.96), GPT-3.5 + informative prompt (AC1=0.93, 95% CI: 0.90-0.95), Gemini-1.0 + informative prompt (AC1=0.90, 95% CI: 0.87-0.93), PaLM + informative prompt (AC1=0.86, 95% CI: 0.82-0.90), LLaMA + informative prompt (AC1=0.82, 95% CI: 0.78-0.86), and Gemini-1.5 + informative prompt (AC1=0.80, 95% CI: 0.76-0.84) all showed almost perfect agreement with radiologists' diagnoses. Diagnostic performance was excellent for GPT-4.0 [area under the curve (AUC)=0.93, 95% CI: 0.92-0.95] and LLaMA (AUC =0.92, 95% CI: 0.90-0.94) in minimal prompted models, while informative prompted models achieved superior performance, with GPT-4.0 + informative prompt reaching the highest AUC of 0.98 (95% CI: 0.97-0.99). All models demonstrated good reproducibility (κ>0.80, P<0.001).

CONCLUSIONS

LLMs demonstrated outstanding performance in the automated classification and diagnosis of targeted CMR interpretations, especially with informative prompts, suggesting the potential for these models to serve as adjunct tools in CMR diagnostic workflows.

摘要

背景

心脏磁共振成像（CMR）的多功能性导致解读复杂且耗时。大语言模型（LLM）在CMR自动解读方面具有变革潜力。我们探讨了LLM对三种常见心脏病（心肌梗死（MI）、扩张型心肌病（DCM）和肥厚型心肌病（HCM））的CMR报告进行自动分类和诊断的能力。

方法

这项回顾性研究纳入了2015年1月至2024年7月连续患者的CMR报告，包括三种心脏病（MI、DCM和HCM）的报告。使用六个LLM，包括GPT - 3.5、GPT - 4.0、Gemini - 1.0、Gemini - 1.5、PaLM和LLaMA，对CMR报告进行分类和诊断。将LLM在最少提示或信息提示下的结果与放射科医生的结果进行比较。使用准确率（ACC）和平衡准确率（BAC）评估不同LLM的分类性能。使用格维特一致性系数（AC1值）评估放射科医生和LLM在心脏病分类方面的一致性。通过受试者操作特征（ROC）曲线分析诊断性能。使用科恩kappa系数评估LLM在不同时间间隔（30天间隔）获得的诊断结果的可重复性。

结果

本研究纳入了543例CMR病例，包括275例MI、120例DCM和148例HCM病例。最少提示的LLM的总体BAC从高到低依次为GPT - 4.0、LLaMA、PaLM、GPT - 3.5、Gemini - 1.5和Gemini - 1.0。GPT - 3.5（P<0.001）、GPT - 4.0（P<0.001）、Gemini - 1.0（P<0.001）、Gemini - 1.5（P = 0.02）和PaLM（P<0.001）的信息提示模型与最少提示模型相比，总体ACC有显著提高，而LLaMA的信息提示模型与最少提示模型相比，总体ACC没有显著提高（P = 0.06）。GPT - 4.0在最少提示（ACC = 88.6%，BAC = 91.7%）和信息提示（ACC = 95.8%，BAC = 97.1%）模型中表现最佳。GPT - 4.0与放射科医生的一致性最高[AC1 = 0.82，95%置信区间（CI）：0.78 - 0.86]，显著优于其他模型（P<0.001）。对于LLM的信息提示模型，GPT - 4.0 +信息提示（AC1 = 0.93，95% CI：0.90 - 0.96）、GPT - 3.5 +信息提示（AC1 = 0.93，95% CI：0.90 - 0.95）、Gemini - 1.0 +信息提示（AC1 = 0.90，95% CI：0.87 - 0.93）、PaLM +信息提示（AC1 = 0.86，95% CI：0.82 - 0.90）、LLaMA +信息提示（AC1 = 0.82，95% CI：0.78 - 0.86）和Gemini - 1.5 +信息提示（AC1 = 0.80，95% CI：0.76 - 0.84）与放射科医生的诊断几乎完全一致。GPT - 4.0[曲线下面积（AUC）= 0.93，95% CI：0.92 - 0.95]和LLaMA（AUC = 0.92，95% CI：0.90 - 0.94）在最少提示模型中的诊断性能优异，而信息提示模型表现更优，GPT - 4.0 +信息提示达到最高AUC为0.98（95% CI：0.97 - 0.99）。所有模型均显示出良好的可重复性（κ>0.80，P<0.001）。