文献检索，用中文搜 PubMed

OBJECTIVE

To evaluate the accuracy of large language models (LLMs) in extracting Coronary Artery Disease-Reporting and Data System (CAD-RADS) 2.0 components from coronary CT angiography (CCTA) reports, and assess the impact of prompting strategies.

MATERIALS AND METHODS

In this multi-institutional study, we collected 319 synthetic, semi-structured CCTA reports from six institutions to protect patient privacy while maintaining clinical relevance. The dataset included 150 reports from a primary institution (100 for instruction development and 50 for internal testing) and 169 reports from five external institutions for external testing. Board-certified radiologists established reference standards following the CAD-RADS 2.0 guidelines for all three components: stenosis severity, plaque burden, and modifiers. Six LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet, o1-mini, Gemini-1.5-Pro, and DeepSeek-R1-Distill-Qwen-14B) were evaluated using an optimized instruction with prompting strategies, including zero-shot or few-shot with or without chain-of-thought (CoT) prompting. The accuracy was assessed and compared using McNemar's test.

RESULTS

LLMs demonstrated robust accuracy across all CAD-RADS 2.0 components. Peak stenosis severity accuracies reached 0.980 (48/49, Claude-3.5-Sonnet and o1-mini) in internal testing and 0.946 (158/167, GPT-4o and o1-mini) in external testing. Plaque burden extraction showed exceptional accuracy, with multiple models achieving perfect accuracy (43/43) in internal testing and 0.993 (137/138, GPT-4o, and o1-mini) in external testing. Modifier detection demonstrated consistently high accuracy (≥0.990) across most models. One open-source model, DeepSeek-R1-Distill-Qwen-14B, showed a relatively low accuracy for stenosis severity: 0.898 (44/49, internal) and 0.820 (137/167, external). CoT prompting significantly enhanced the accuracy of several models, with GPT-4 showing the most substantial improvements: stenosis severity accuracy increased by 0.192 ( < 0.001) and plaque burden accuracy by 0.152 ( < 0.001) in external testing.

CONCLUSION

LLMs demonstrated high accuracy in automated extraction of CAD-RADS 2.0 components from semi-structured CCTA reports, particularly when used with CoT prompting.

目的

评估大语言模型（LLMs）从冠状动脉CT血管造影（CCTA）报告中提取冠状动脉疾病报告和数据系统（CAD-RADS）2.0组件的准确性，并评估提示策略的影响。

材料与方法

在这项多机构研究中，我们从六个机构收集了319份合成的、半结构化的CCTA报告，以保护患者隐私同时保持临床相关性。数据集包括来自一个主要机构的150份报告（100份用于指令开发，50份用于内部测试）和来自五个外部机构的169份报告用于外部测试。经过委员会认证的放射科医生根据CAD-RADS 2.0指南为所有三个组件（狭窄严重程度、斑块负荷和修饰符）建立了参考标准。使用带有提示策略的优化指令对六个大语言模型（GPT-4、GPT-4o、Claude-3.5-Sonnet、o1-mini、Gemini-1.5-Pro和DeepSeek-R1-Distill-Qwen-14B）进行评估，提示策略包括零样本或少样本，有无思维链（CoT）提示。使用McNemar检验评估并比较准确性。

结果

大语言模型在所有CAD-RADS 2.0组件上均表现出强大的准确性。内部测试中，峰值狭窄严重程度的准确率达到0.980（48/49，Claude-3.5-Sonnet和o1-mini），外部测试中达到0.946（158/167，GPT-4o和o1-mini）。斑块负荷提取显示出极高的准确性，多个模型在内部测试中达到完美准确率（43/43），外部测试中达到0.993（137/138，GPT-4o和o1-mini）。修饰符检测在大多数模型中表现出始终如一的高准确率（≥0.990）。一个开源模型DeepSeek-R1-Distill-Qwen-14B在狭窄严重程度方面显示出相对较低的准确率：内部为0.898（44/49），外部为0.820（137/167）。思维链提示显著提高了几个模型的准确性，GPT-4表现出最大的改进：外部测试中狭窄严重程度准确率提高了0.192（<0.001），斑块负荷准确率提高了0.152（<0.001）。

结论

大语言模型在从半结构化CCTA报告中自动提取CAD-RADS 2.0组件方面表现出高准确性，特别是在与思维链提示一起使用时。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于从半结构化冠状动脉CT血管造影报告中提取CAD-RADS 2.0的大语言模型：一项多机构研究

Large Language Models for CAD-RADS 2.0 Extraction From Semi-Structured Coronary CT Angiography Reports: A Multi-Institutional Study.

作者信息

机构信息

出版信息

相似文献

本文引用的文献