Suppr超能文献

在从自由文本报告中提取胸部X光检查结果方面,确保隐私的开放权重大型语言模型与封闭权重的GPT-4o具有竞争力。

Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports.

作者信息

Nowak Sebastian, Wulff Benjamin, Layer Yannik C, Theis Maike, Isaak Alexander, Salam Babak, Block Wolfgang, Kuetting Daniel, Pieper Claus C, Luetkens Julian A, Attenberger Ulrike, Sprinkart Alois M

机构信息

From the Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany.

出版信息

Radiology. 2025 Jan;314(1):e240895. doi: 10.1148/radiol.240895.

Abstract

Background Large-scale secondary use of clinical databases requires automated tools for retrospective extraction of structured content from free-text radiology reports. Purpose To share data and insights on the application of privacy-preserving open-weights large language models (LLMs) for reporting content extraction with comparison to standard rule-based systems and the closed-weights LLMs from OpenAI. Materials and Methods In this retrospective exploratory study conducted between May 2024 and September 2024, zero-shot prompting of 17 open-weights LLMs was preformed. These LLMs with model weights released under open licenses were compared with rule-based annotation and with OpenAI's GPT-4o, GPT-4o-mini, GPT-4-turbo, and GPT-3.5-turbo on a manually annotated public English chest radiography dataset (Indiana University, 3927 patients and reports). An annotated nonpublic German chest radiography dataset (18 500 reports, 16 844 patients [10 340 male; mean age, 62.6 years ± 21.5 {SD}]) was used to compare local fine-tuning of all open-weights LLMs via low-rank adaptation and 4-bit quantization to bidirectional encoder representations from transformers (BERT) with different subsets of reports (from 10 to 14 580). Nonoverlapping 95% CIs of macro-averaged F1 scores were defined as relevant differences. Results For the English reports, the highest zero-shot macro-averaged F1 score was observed for GPT-4o (92.4% [95% CI: 87.9, 95.9]); GPT-4o outperformed the rule-based CheXpert [Stanford University] (73.1% [95% CI: 65.1, 79.7]) but was comparable in performance to several open-weights LLMs (top three: Mistral-Large [Mistral AI], 92.6% [95% CI: 88.2, 96.0]; Llama-3.1-70b [Meta AI], 92.2% [95% CI: 87.1, 95.8]; and Llama-3.1-405b [Meta AI]: 90.3% [95% CI: 84.6, 94.5]). For the German reports, Mistral-Large (91.6% [95% CI: 90.5, 92.7]) had the highest zero-shot macro-averaged F1 score among the six other open-weights LLMs and outperformed the rule-based annotation (74.8% [95% CI: 73.3, 76.1]). Using 1000 reports for fine-tuning, all LLMs (top three: Mistral-Large, 94.3% [95% CI: 93.5, 95.2]; OpenBioLLM-70b [Saama]: 93.9% [95% CI: 92.9, 94.8]; and Mixtral-8×22b [Mistral AI]: 93.8% [95% CI: 92.8, 94.7]) achieved significantly higher macro-averaged F1 score than did BERT (86.7% [95% CI: 85.0, 88.3]); however, the differences were not relevant when 2000 or more reports were used for fine-tuning. Conclusion LLMs have the potential to outperform rule-based systems for zero-shot "out-of-the-box" structuring of report databases, with privacy-ensuring open-weights LLMs being competitive with closed-weights GPT-4o. Additionally, the open-weights LLM outperformed BERT when moderate numbers of reports were used for fine-tuning. Published under a CC BY 4.0 license. See also the editorial by Gee and Yao in this issue.

摘要

背景 临床数据库的大规模二次使用需要自动化工具,以便从自由文本放射学报告中回顾性提取结构化内容。目的 分享关于应用隐私保护开放权重大型语言模型(LLM)进行报告内容提取的数据和见解,并与标准的基于规则的系统以及OpenAI的封闭权重LLM进行比较。材料和方法 在2024年5月至2024年9月进行的这项回顾性探索性研究中,对17个开放权重LLM进行了零样本提示。将这些模型权重根据开放许可发布的LLM与基于规则的注释以及OpenAI的GPT-4o、GPT-4o-mini、GPT-4-turbo和GPT-3.5-turbo在一个人工注释的公共英语胸部X线摄影数据集(印第安纳大学,3927例患者和报告)上进行比较。使用一个注释过的非公开德语胸部X线摄影数据集(18500份报告,16844例患者[10340例男性;平均年龄,62.6岁±21.5{标准差}]),通过低秩适应和4位量化对所有开放权重LLM进行局部微调,以与来自不同报告子集(从10到14580)的双向编码器表示来自变换器(BERT)进行比较。宏观平均F1分数的非重叠95%置信区间被定义为相关差异。结果 对于英语报告,GPT-4o的零样本宏观平均F1分数最高(92.4%[95%置信区间:87.9,95.9]);GPT-4o优于基于规则的CheXpert[斯坦福大学](73.1%[95%置信区间:65.1,79.7]),但在性能上与几个开放权重LLM相当(前三名:Mistral-Large[Mistral AI],92.6%[95%置信区间:88.2,96.0];Llama-3.1-70b[Meta AI],92.2%[95%置信区间:87.1,95.8];和Llama-3.1-405b[Meta AI]:90.3%[95%置信区间:84.6,94.5])。对于德语报告,Mistral-Large(91.6%[95%置信区间:90.5,92.7])在其他六个开放权重LLM中零样本宏观平均F1分数最高,并且优于基于规则的注释(74.8%[95%置信区间:73.3,76.1])。使用1000份报告进行微调时,所有LLM(前三名:Mistral-Large,94.3%[95%置信区间:93.5,95.2];OpenBioLLM-70b[Saama]:93.9%[95%置信区间:92.9,94.8];和Mixtral-8×22b[Mistral AI]:93.8%[95%置信区间:92.8,94.7])的宏观平均F1分数显著高于BERT(86.7%[95%置信区间:85.0,88.3]);然而,当使用2000份或更多报告进行微调时,差异不显著。结论 LLM有潜力在报告数据库的零样本“开箱即用”结构化方面优于基于规则的系统,具有隐私保护的开放权重LLM与封闭权重的GPT-4o具有竞争力。此外,当使用适量报告进行微调时,开放权重LLM的性能优于BERT。根据知识共享署名4.0许可发布。另见本期Gee和Yao的社论。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验