• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用GPT-4o和Llama-3.3-70B从自由文本中风CT报告中提取数据:注释指南的影响

Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.

作者信息

Wihl Jonas, Rosenkranz Enrike, Schramm Severin, Berberich Cornelius, Griessmair Michael, Woźnicki Piotr, Pinto Francisco, Ziegelmayer Sebastian, Adams Lisa C, Bressem Keno K, Kirschke Jan S, Zimmer Claus, Wiestler Benedikt, Hedderich Dennis, Kim Su Hwan

机构信息

Department of Diagnostic and Interventional Neuroradiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany.

Department of Diagnostic, Interventional and Pediatric Radiology, Inselspital Bern, University of Bern, Bern, Switzerland.

出版信息

Eur Radiol Exp. 2025 Jun 19;9(1):61. doi: 10.1186/s41747-025-00600-2.

DOI:10.1186/s41747-025-00600-2
PMID:40536631
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12179022/
Abstract

BACKGROUND

To evaluate the impact of an annotation guideline on the performance of large language models (LLMs) in extracting data from stroke computed tomography (CT) reports.

METHODS

The performance of GPT-4o and Llama-3.3-70B in extracting ten imaging findings from stroke CT reports was assessed in two datasets from a single academic stroke center. Dataset A (n = 200) was a stratified cohort including various pathological findings, whereas dataset B (n = 100) was a consecutive cohort. Initially, an annotation guideline providing clear data extraction instructions was designed based on a review of cases with inter-annotator disagreements in dataset A. For each LLM, data extraction was performed under two conditions: with the annotation guideline included in the prompt and without it.

RESULTS

GPT-4o consistently demonstrated superior performance over Llama-3.3-70B under identical conditions, with micro-averaged precision ranging from 0.83 to 0.95 for GPT-4o and from 0.65 to 0.86 for Llama-3.3-70B. Across both models and both datasets, incorporating the annotation guideline into the LLM input resulted in higher precision rates, while recall rates largely remained stable. In dataset B, the precision of GPT-4o and Llama-3-70B improved from 0.83 to 0.95 and from 0.87 to 0.94, respectively. Overall classification performance with and without the annotation guideline was significantly different in five out of six conditions.

CONCLUSION

GPT-4o and Llama-3.3-70B show promising performance in extracting imaging findings from stroke CT reports, although GPT-4o steadily outperformed Llama-3.3-70B. We also provide evidence that well-defined annotation guidelines can enhance LLM data extraction accuracy.

RELEVANCE STATEMENT

Annotation guidelines can improve the accuracy of LLMs in extracting findings from radiological reports, potentially optimizing data extraction for specific downstream applications.

KEY POINTS

LLMs have utility in data extraction from radiology reports, but the role of annotation guidelines remains underexplored. Data extraction accuracy from stroke CT reports by GPT-4o and Llama-3.3-70B improved when well-defined annotation guidelines were incorporated into the model prompt. Well-defined annotation guidelines can improve the accuracy of LLMs in extracting imaging findings from radiological reports.

摘要

背景

评估注释指南对大语言模型(LLMs)从卒中计算机断层扫描(CT)报告中提取数据性能的影响。

方法

在来自单个学术卒中中心的两个数据集中,评估GPT-4o和Llama-3.3-70B从卒中CT报告中提取十种影像表现的性能。数据集A(n = 200)是一个分层队列,包括各种病理表现,而数据集B(n = 100)是一个连续队列。最初,基于对数据集A中标注者间存在分歧的病例的回顾,设计了一个提供清晰数据提取说明的注释指南。对于每个大语言模型,在两种条件下进行数据提取:提示中包含注释指南和不包含注释指南。

结果

在相同条件下,GPT-4o始终表现出优于Llama-3.3-70B的性能,GPT-4o的微平均精度在0.83至0.95之间,Llama-3.3-70B的微平均精度在0.65至0.86之间。在两个模型和两个数据集中,将注释指南纳入大语言模型输入会导致更高的精确率,而召回率基本保持稳定。在数据集B中,GPT-4o和Llama-3-70B的精确率分别从0.83提高到0.95和从0.87提高到0.94。在六种情况中的五种情况下,有和没有注释指南时的总体分类性能存在显著差异。

结论

GPT-4o和Llama-3.3-70B在从卒中CT报告中提取影像表现方面显示出有前景的性能,尽管GPT-4o始终优于Llama-3.3-70B。我们还提供了证据表明,定义明确的注释指南可以提高大语言模型的数据提取准确性。

相关性声明

注释指南可以提高大语言模型从放射学报告中提取表现的准确性,有可能优化针对特定下游应用的数据提取。

关键点

大语言模型在从放射学报告中提取数据方面具有实用性,但注释指南的作用仍未得到充分探索。当将定义明确的注释指南纳入模型提示时,GPT-4o和Llama-3.3-70B从卒中CT报告中提取数据的准确性得到提高。定义明确的注释指南可以提高大语言模型从放射学报告中提取影像表现的准确性。

相似文献

1
Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.使用GPT-4o和Llama-3.3-70B从自由文本中风CT报告中提取数据:注释指南的影响
Eur Radiol Exp. 2025 Jun 19;9(1):61. doi: 10.1186/s41747-025-00600-2.
2
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
3
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
4
RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering.RadioRAG:用于放射学问答的在线检索增强生成
Radiol Artif Intell. 2025 Jun 18:e240476. doi: 10.1148/ryai.240476.
5
Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports.在从自由文本报告中提取胸部X光检查结果方面,确保隐私的开放权重大型语言模型与封闭权重的GPT-4o具有竞争力。
Radiology. 2025 Jan;314(1):e240895. doi: 10.1148/radiol.240895.
6
Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study.使用来自Transformer句子嵌入的语言无关双向编码器表示法检测冗余健康调查问题:算法开发研究
JMIR Med Inform. 2025 Jun 10;13:e71687. doi: 10.2196/71687.
7
Predicting 30-Day Postoperative Mortality and American Society of Anesthesiologists Physical Status Using Retrieval-Augmented Large Language Models: Development and Validation Study.使用检索增强大语言模型预测术后30天死亡率和美国麻醉医师协会身体状况:开发与验证研究
J Med Internet Res. 2025 Jun 3;27:e75052. doi: 10.2196/75052.
8
Image-Based Diagnostic Performance of LLMs vs CNNs for Oral Lichen Planus: Example-Guided and Differential Diagnosis.基于图像的大语言模型与卷积神经网络在口腔扁平苔藓诊断性能的比较:示例引导与鉴别诊断
Int Dent J. 2025 Jun 6;75(4):100848. doi: 10.1016/j.identj.2025.100848.
9
A comparative analysis of privacy-preserving large language models for automated echocardiography report analysis.用于自动超声心动图报告分析的隐私保护大语言模型的比较分析。
J Am Med Inform Assoc. 2025 Jul 1;32(7):1120-1129. doi: 10.1093/jamia/ocaf056.
10
Enhancing the Readability of Online Patient Education Materials Using Large Language Models: Cross-Sectional Study.使用大语言模型提高在线患者教育材料的可读性:横断面研究。
J Med Internet Res. 2025 Jun 4;27:e69955. doi: 10.2196/69955.

本文引用的文献

1
Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases.多模态提示元素对GPT-4V在具有挑战性的脑部MRI病例诊断性能的影响。
Radiology. 2025 Jan;314(1):e240689. doi: 10.1148/radiol.240689.
2
The Impact of Temperature on Extracting Information From Clinical Trial Publications Using Large Language Models.温度对使用大语言模型从临床试验出版物中提取信息的影响
Cureus. 2024 Dec 15;16(12):e75748. doi: 10.7759/cureus.75748. eCollection 2024 Dec.
3
Open-source Large Language Models can Generate Labels from Radiology Reports for Training Convolutional Neural Networks.
开源大语言模型可从放射学报告生成标签以训练卷积神经网络。
Acad Radiol. 2025 May;32(5):2402-2410. doi: 10.1016/j.acra.2024.12.028. Epub 2025 Jan 6.
4
Large Language Model Ability to Translate CT and MRI Free-Text Radiology Reports Into Multiple Languages.大型语言模型将CT和MRI自由文本放射学报告翻译成多种语言的能力。
Radiology. 2024 Dec;313(3):e241736. doi: 10.1148/radiol.241736.
5
The path forward for large language models in medicine is open.医学领域大语言模型的未来发展道路是开放的。
NPJ Digit Med. 2024 Nov 27;7(1):339. doi: 10.1038/s41746-024-01344-w.
6
Extraction of clinical data on major pulmonary diseases from unstructured radiologic reports using a large language model.使用大型语言模型从非结构化放射报告中提取主要肺部疾病的临床数据。
PLoS One. 2024 Nov 25;19(11):e0314136. doi: 10.1371/journal.pone.0314136. eCollection 2024.
7
Collaboration between clinicians and vision-language models in radiology report generation.临床医生与视觉语言模型在放射学报告生成中的协作。
Nat Med. 2025 Feb;31(2):599-608. doi: 10.1038/s41591-024-03302-1. Epub 2024 Nov 7.
8
Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports.比较商用和开源大语言模型在标注胸部 X 光报告中的表现。
Radiology. 2024 Oct;313(1):e241139. doi: 10.1148/radiol.241139.
9
MedConceptsQA: Open source medical concepts QA benchmark.MedConceptsQA:开源医学概念问答基准。
Comput Biol Med. 2024 Nov;182:109089. doi: 10.1016/j.compbiomed.2024.109089. Epub 2024 Sep 13.
10
Closing the gap between open source and commercial large language models for medical evidence summarization.弥合用于医学证据总结的开源大型语言模型与商业大型语言模型之间的差距。
NPJ Digit Med. 2024 Sep 9;7(1):239. doi: 10.1038/s41746-024-01239-w.