Suppr超能文献

用于从半结构化冠状动脉CT血管造影报告中提取CAD-RADS 2.0的大语言模型:一项多机构研究

Large Language Models for CAD-RADS 2.0 Extraction From Semi-Structured Coronary CT Angiography Reports: A Multi-Institutional Study.

作者信息

Min Dabin, Jin Kwang Nam, Bang SangHeum, Kim Moon Young, Kim Hack-Lyoung, Jeong Won Gi, Lee Hye-Jeong, Beck Kyongmin Sarah, Hwang Sung Ho, Kim Eun Young, Park Chang Min

机构信息

Interdisciplinary Program in Bioengineering, Seoul National University Graduate School, Seoul, Republic of Korea.

Integrated Major in Innovative Medical Science, Seoul National University Graduate School, Seoul, Republic of Korea.

出版信息

Korean J Radiol. 2025 Sep;26(9):817-831. doi: 10.3348/kjr.2025.0293.

Abstract

OBJECTIVE

To evaluate the accuracy of large language models (LLMs) in extracting Coronary Artery Disease-Reporting and Data System (CAD-RADS) 2.0 components from coronary CT angiography (CCTA) reports, and assess the impact of prompting strategies.

MATERIALS AND METHODS

In this multi-institutional study, we collected 319 synthetic, semi-structured CCTA reports from six institutions to protect patient privacy while maintaining clinical relevance. The dataset included 150 reports from a primary institution (100 for instruction development and 50 for internal testing) and 169 reports from five external institutions for external testing. Board-certified radiologists established reference standards following the CAD-RADS 2.0 guidelines for all three components: stenosis severity, plaque burden, and modifiers. Six LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet, o1-mini, Gemini-1.5-Pro, and DeepSeek-R1-Distill-Qwen-14B) were evaluated using an optimized instruction with prompting strategies, including zero-shot or few-shot with or without chain-of-thought (CoT) prompting. The accuracy was assessed and compared using McNemar's test.

RESULTS

LLMs demonstrated robust accuracy across all CAD-RADS 2.0 components. Peak stenosis severity accuracies reached 0.980 (48/49, Claude-3.5-Sonnet and o1-mini) in internal testing and 0.946 (158/167, GPT-4o and o1-mini) in external testing. Plaque burden extraction showed exceptional accuracy, with multiple models achieving perfect accuracy (43/43) in internal testing and 0.993 (137/138, GPT-4o, and o1-mini) in external testing. Modifier detection demonstrated consistently high accuracy (≥0.990) across most models. One open-source model, DeepSeek-R1-Distill-Qwen-14B, showed a relatively low accuracy for stenosis severity: 0.898 (44/49, internal) and 0.820 (137/167, external). CoT prompting significantly enhanced the accuracy of several models, with GPT-4 showing the most substantial improvements: stenosis severity accuracy increased by 0.192 ( < 0.001) and plaque burden accuracy by 0.152 ( < 0.001) in external testing.

CONCLUSION

LLMs demonstrated high accuracy in automated extraction of CAD-RADS 2.0 components from semi-structured CCTA reports, particularly when used with CoT prompting.

摘要

目的

评估大语言模型(LLMs)从冠状动脉CT血管造影(CCTA)报告中提取冠状动脉疾病报告和数据系统(CAD-RADS)2.0组件的准确性,并评估提示策略的影响。

材料与方法

在这项多机构研究中,我们从六个机构收集了319份合成的、半结构化的CCTA报告,以保护患者隐私同时保持临床相关性。数据集包括来自一个主要机构的150份报告(100份用于指令开发,50份用于内部测试)和来自五个外部机构的169份报告用于外部测试。经过委员会认证的放射科医生根据CAD-RADS 2.0指南为所有三个组件(狭窄严重程度、斑块负荷和修饰符)建立了参考标准。使用带有提示策略的优化指令对六个大语言模型(GPT-4、GPT-4o、Claude-3.5-Sonnet、o1-mini、Gemini-1.5-Pro和DeepSeek-R1-Distill-Qwen-14B)进行评估,提示策略包括零样本或少样本,有无思维链(CoT)提示。使用McNemar检验评估并比较准确性。

结果

大语言模型在所有CAD-RADS 2.0组件上均表现出强大的准确性。内部测试中,峰值狭窄严重程度的准确率达到0.980(48/49,Claude-3.5-Sonnet和o1-mini),外部测试中达到0.946(158/167,GPT-4o和o1-mini)。斑块负荷提取显示出极高的准确性,多个模型在内部测试中达到完美准确率(43/43),外部测试中达到0.993(137/138,GPT-4o和o1-mini)。修饰符检测在大多数模型中表现出始终如一的高准确率(≥0.990)。一个开源模型DeepSeek-R1-Distill-Qwen-14B在狭窄严重程度方面显示出相对较低的准确率:内部为0.898(44/49),外部为0.820(137/167)。思维链提示显著提高了几个模型的准确性,GPT-4表现出最大的改进:外部测试中狭窄严重程度准确率提高了0.192(<0.001),斑块负荷准确率提高了0.152(<0.001)。

结论

大语言模型在从半结构化CCTA报告中自动提取CAD-RADS 2.0组件方面表现出高准确性,特别是在与思维链提示一起使用时。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0393/12394816/914b5fe2e6a9/kjr-26-817-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验