• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

开源和专有大语言模型在生成患者友好型放射科胸部CT报告方面的表现。

Performance of open-source and proprietary large language models in generating patient-friendly radiology chest CT reports.

作者信息

Prucker Philipp, Busch Felix, Dorfner Felix, Mertens Christian J, Bayerl Nadine, Makowski Marcus R, Bressem Keno K, Adams Lisa C

机构信息

Department of Diagnostic and Interventional Radiology, Technical University Munich, Ismaninger Str. 22, 81675 Munich, Germany.

Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Department of Radiology, Charitéplatz 1, 10117 Berlin, Germany.

出版信息

Clin Imaging. 2025 Sep;125:110557. doi: 10.1016/j.clinimag.2025.110557. Epub 2025 Jul 5.

DOI:10.1016/j.clinimag.2025.110557
PMID:40639135
Abstract

RATIONALE AND OBJECTIVES

Large Language Models (LLMs) show promise for generating patient-friendly radiology reports, but the performance of open-source versus proprietary LLMs needs assessment. To compare open-source and proprietary LLMs in generating patient-friendly radiology reports from chest CTs using quantitative readability metrics and qualitative assessments by radiologists.

MATERIALS AND METHODS

Fifty chest CT reports were processed by seven LLMs: three open-source models (Llama-3-70b, Mistral-7b, Mixtral-8x7b) and four proprietary models (GPT-4, GPT-3.5-Turbo, Claude-3-Opus, Gemini-Ultra). Simplification was evaluated using five quantitative readability metrics. Three radiologists rated patient-friendliness on a five-point Likert scale across five criteria. Content and coherence errors were counted. Inter-rater reliability and differences among models were statistically assessed.

RESULTS

Inter-rater reliability was substantial to near perfect (κ = 0.76-0.86). Qualitatively, Llama-3-70b was non-inferior to leading proprietary models in 4/5 categories. GPT-3.5-Turbo showed the best overall readability, outperforming GPT-4 in two metrics. Llama-3-70b outperformed GPT-3.5-Turbo on the CLI (p = 0.006). Claude-3-Opus and Gemini-Ultra scored lower on readability but were rated highly in qualitative assessments. Claude-3-Opus maintained perfect factual accuracy. Claude-3-Opus and GPT-4 outperformed Llama-3-70b in emotional sensitivity (90.0 % vs 46.0 %, p < 0.001).

CONCLUSIONS

Llama-3-70b shows strong potential in generating quality, patient-friendly radiology reports, challenging proprietary models. With further adaptation, open-source LLMs could advance patient-friendly reporting technology.

摘要

原理与目的

大语言模型(LLMs)在生成患者友好型放射学报告方面显示出前景,但开源与专有大语言模型的性能需要评估。通过定量可读性指标和放射科医生的定性评估,比较开源和专有大语言模型在从胸部CT生成患者友好型放射学报告方面的表现。

材料与方法

由七个大语言模型处理五十份胸部CT报告:三个开源模型(Llama - 3 - 70b、Mistral - 7b、Mixtral - 8x7b)和四个专有模型(GPT - 4、GPT - 3.5 - Turbo、Claude - 3 - Opus、Gemini - Ultra)。使用五个定量可读性指标评估简化情况。三位放射科医生根据五个标准,采用五点李克特量表对患者友好度进行评分。统计内容和连贯性错误。对评分者间信度和模型间差异进行统计学评估。

结果

评分者间信度较高至近乎完美(κ = 0.76 - 0.86)。定性方面,Llama - 3 - 70b在五分之四的类别中不逊色于领先的专有模型。GPT - 3.5 - Turbo显示出最佳的整体可读性,在两个指标上优于GPT - 4。Llama - 3 - 70b在CLI上优于GPT - 3.5 - Turbo(p = 0.006)。Claude - 3 - Opus和Gemini - Ultra在可读性方面得分较低,但在定性评估中获得高分。Claude - 3 - Opus保持了完美的事实准确性。Claude - 3 - Opus和GPT - 4在情感敏感性方面优于Llama - 3 - 70b(90.0%对46.0%,p < 0.001)。

结论

Llama - 3 - 70b在生成高质量、患者友好型放射学报告方面显示出强大潜力,对专有模型构成挑战。通过进一步调整,开源大语言模型可以推动患者友好型报告技术的发展。

相似文献

1
Performance of open-source and proprietary large language models in generating patient-friendly radiology chest CT reports.开源和专有大语言模型在生成患者友好型放射科胸部CT报告方面的表现。
Clin Imaging. 2025 Sep;125:110557. doi: 10.1016/j.clinimag.2025.110557. Epub 2025 Jul 5.
2
Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.用于简化介入放射学报告的大语言模型:一项比较分析
Acad Radiol. 2025 Feb;32(2):888-898. doi: 10.1016/j.acra.2024.09.041. Epub 2024 Sep 30.
3
Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.使用GPT-4o和Llama-3.3-70B从自由文本中风CT报告中提取数据:注释指南的影响
Eur Radiol Exp. 2025 Jun 19;9(1):61. doi: 10.1186/s41747-025-00600-2.
4
Large language models for error detection in radiology reports: a comparative analysis between closed-source and privacy-compliant open-source models.用于放射学报告错误检测的大语言模型:闭源模型与符合隐私规定的开源模型的对比分析
Eur Radiol. 2025 Feb 20. doi: 10.1007/s00330-025-11438-y.
5
Performance analysis of large language models in multi-disease detection from chest computed tomography reports: a comparative study: Experimental Research.基于胸部计算机断层扫描报告的多疾病检测中大型语言模型的性能分析:一项比较研究:实验研究
Int J Surg. 2025 Jun 5. doi: 10.1097/JS9.0000000000002582.
6
Evaluating Large Language Models for Preoperative Patient Education in Superior Capsular Reconstruction: Comparative Study of Claude, GPT, and Gemini.评估大语言模型在肩胛下肌上囊重建术前患者教育中的应用:Claude、GPT和Gemini的比较研究
JMIR Perioper Med. 2025 Jun 12;8:e70047. doi: 10.2196/70047.
7
Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation.用于从非结构化和半结构化电子健康记录中提取数据的大语言模型:多模型性能评估
BMJ Health Care Inform. 2025 Jan 19;32(1):e101139. doi: 10.1136/bmjhci-2024-101139.
8
Intra-axial primary brain tumor differentiation: comparing large language models on structured MRI reports vs. radiologists on images.轴内原发性脑肿瘤鉴别:比较基于结构化MRI报告的大语言模型与阅片放射科医生的表现
Eur Radiol. 2025 Aug 22. doi: 10.1007/s00330-025-11924-3.
9
Development of a Large-Scale Dataset of Chest Computed Tomography Reports in Japanese and a High-Performance Finding Classification Model: Dataset Development and Validation Study.日语胸部计算机断层扫描报告大规模数据集的开发及高性能发现分类模型:数据集开发与验证研究
JMIR Med Inform. 2025 Aug 28;13:e71137. doi: 10.2196/71137.
10
Evaluating large language model performance to support the diagnosis and management of patients with primary immune disorders.评估大型语言模型的性能以支持原发性免疫疾病患者的诊断和管理。
J Allergy Clin Immunol. 2025 Feb 14. doi: 10.1016/j.jaci.2025.02.004.