Min Dabin, Jin Kwang Nam, Bang SangHeum, Kim Moon Young, Kim Hack-Lyoung, Jeong Won Gi, Lee Hye-Jeong, Beck Kyongmin Sarah, Hwang Sung Ho, Kim Eun Young, Park Chang Min
Interdisciplinary Program in Bioengineering, Seoul National University Graduate School, Seoul, Republic of Korea.
Integrated Major in Innovative Medical Science, Seoul National University Graduate School, Seoul, Republic of Korea.
Korean J Radiol. 2025 Sep;26(9):817-831. doi: 10.3348/kjr.2025.0293.
To evaluate the accuracy of large language models (LLMs) in extracting Coronary Artery Disease-Reporting and Data System (CAD-RADS) 2.0 components from coronary CT angiography (CCTA) reports, and assess the impact of prompting strategies.
In this multi-institutional study, we collected 319 synthetic, semi-structured CCTA reports from six institutions to protect patient privacy while maintaining clinical relevance. The dataset included 150 reports from a primary institution (100 for instruction development and 50 for internal testing) and 169 reports from five external institutions for external testing. Board-certified radiologists established reference standards following the CAD-RADS 2.0 guidelines for all three components: stenosis severity, plaque burden, and modifiers. Six LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet, o1-mini, Gemini-1.5-Pro, and DeepSeek-R1-Distill-Qwen-14B) were evaluated using an optimized instruction with prompting strategies, including zero-shot or few-shot with or without chain-of-thought (CoT) prompting. The accuracy was assessed and compared using McNemar's test.
LLMs demonstrated robust accuracy across all CAD-RADS 2.0 components. Peak stenosis severity accuracies reached 0.980 (48/49, Claude-3.5-Sonnet and o1-mini) in internal testing and 0.946 (158/167, GPT-4o and o1-mini) in external testing. Plaque burden extraction showed exceptional accuracy, with multiple models achieving perfect accuracy (43/43) in internal testing and 0.993 (137/138, GPT-4o, and o1-mini) in external testing. Modifier detection demonstrated consistently high accuracy (≥0.990) across most models. One open-source model, DeepSeek-R1-Distill-Qwen-14B, showed a relatively low accuracy for stenosis severity: 0.898 (44/49, internal) and 0.820 (137/167, external). CoT prompting significantly enhanced the accuracy of several models, with GPT-4 showing the most substantial improvements: stenosis severity accuracy increased by 0.192 ( < 0.001) and plaque burden accuracy by 0.152 ( < 0.001) in external testing.
LLMs demonstrated high accuracy in automated extraction of CAD-RADS 2.0 components from semi-structured CCTA reports, particularly when used with CoT prompting.
评估大语言模型(LLMs)从冠状动脉CT血管造影(CCTA)报告中提取冠状动脉疾病报告和数据系统(CAD-RADS)2.0组件的准确性,并评估提示策略的影响。
在这项多机构研究中,我们从六个机构收集了319份合成的、半结构化的CCTA报告,以保护患者隐私同时保持临床相关性。数据集包括来自一个主要机构的150份报告(100份用于指令开发,50份用于内部测试)和来自五个外部机构的169份报告用于外部测试。经过委员会认证的放射科医生根据CAD-RADS 2.0指南为所有三个组件(狭窄严重程度、斑块负荷和修饰符)建立了参考标准。使用带有提示策略的优化指令对六个大语言模型(GPT-4、GPT-4o、Claude-3.5-Sonnet、o1-mini、Gemini-1.5-Pro和DeepSeek-R1-Distill-Qwen-14B)进行评估,提示策略包括零样本或少样本,有无思维链(CoT)提示。使用McNemar检验评估并比较准确性。
大语言模型在所有CAD-RADS 2.0组件上均表现出强大的准确性。内部测试中,峰值狭窄严重程度的准确率达到0.980(48/49,Claude-3.5-Sonnet和o1-mini),外部测试中达到0.946(158/167,GPT-4o和o1-mini)。斑块负荷提取显示出极高的准确性,多个模型在内部测试中达到完美准确率(43/43),外部测试中达到0.993(137/138,GPT-4o和o1-mini)。修饰符检测在大多数模型中表现出始终如一的高准确率(≥0.990)。一个开源模型DeepSeek-R1-Distill-Qwen-14B在狭窄严重程度方面显示出相对较低的准确率:内部为0.898(44/49),外部为0.820(137/167)。思维链提示显著提高了几个模型的准确性,GPT-4表现出最大的改进:外部测试中狭窄严重程度准确率提高了0.192(<0.001),斑块负荷准确率提高了0.152(<0.001)。
大语言模型在从半结构化CCTA报告中自动提取CAD-RADS 2.0组件方面表现出高准确性,特别是在与思维链提示一起使用时。