• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

开源微调大型语言模型在放射科印象生成中的应用:多读者性能研究。

An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study.

机构信息

Department of Radiology and Biomedical Imaging, University of California, San Francisco, San Francisco, CA, USA.

Department of Radiology, University of Washington, Seattle, WA, USA.

出版信息

BMC Med Imaging. 2024 Sep 27;24(1):254. doi: 10.1186/s12880-024-01435-w.

DOI:10.1186/s12880-024-01435-w
PMID:39333958
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11428559/
Abstract

BACKGROUND

The impression section integrates key findings of a radiology report but can be subjective and variable. We sought to fine-tune and evaluate an open-source Large Language Model (LLM) in automatically generating impressions from the remainder of a radiology report across different imaging modalities and hospitals.

METHODS

In this institutional review board-approved retrospective study, we collated a dataset of CT, US, and MRI radiology reports from the University of California San Francisco Medical Center (UCSFMC) (n = 372,716) and the Zuckerberg San Francisco General (ZSFG) Hospital and Trauma Center (n = 60,049), both under a single institution. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, an automatic natural language evaluation metric that measures word overlap, was used for automatic natural language evaluation. A reader study with five cardiothoracic radiologists was performed to more strictly evaluate the model's performance on a specific modality (CT chest exams) with a radiologist subspecialist baseline. We stratified the results of the reader performance study based on the diagnosis category and the original impression length to gauge case complexity.

RESULTS

The LLM achieved ROUGE-L scores of 46.51, 44.2, and 50.96 on UCSFMC and upon external validation, ROUGE-L scores of 40.74, 37.89, and 24.61 on ZSFG across the CT, US, and MRI modalities respectively, implying a substantial degree of overlap between the model-generated impressions and impressions written by the subspecialist attending radiologists, but with a degree of degradation upon external validation. In our reader study, the model-generated impressions achieved overall mean scores of 3.56/4, 3.92/4, 3.37/4, 18.29 s,12.32 words, and 84 while the original impression written by a subspecialist radiologist achieved overall mean scores of 3.75/4, 3.87/4, 3.54/4, 12.2 s, 5.74 words, and 89 for clinical accuracy, grammatical accuracy, stylistic quality, edit time, edit distance, and ROUGE-L score respectively. The LLM achieved the highest clinical accuracy ratings for acute/emergent findings and on shorter impressions.

CONCLUSIONS

An open-source fine-tuned LLM can generate impressions to a satisfactory level of clinical accuracy, grammatical accuracy, and stylistic quality. Our reader performance study demonstrates the potential of large language models in drafting radiology report impressions that can aid in streamlining radiologists' workflows.

摘要

背景

印象部分整合了放射学报告的关键发现,但可能具有主观性和可变性。我们试图微调和评估一个开源的大型语言模型(LLM),以便在不同的成像方式和医院中自动从放射学报告的其余部分生成印象。

方法

在这项机构审查委员会批准的回顾性研究中,我们从加利福尼亚大学旧金山医学中心(UCSFMC)(n=372716)和扎克伯格旧金山综合医院和创伤中心(ZSFG)(n=60049)收集了 CT、US 和 MRI 放射学报告数据集,这两个数据集都属于同一个机构。ROUGE(Recall-Oriented Understudy for Gisting Evaluation)评分是一种自动自然语言评估指标,用于测量单词重叠度,用于自动自然语言评估。我们进行了一项读者研究,其中包括五名心胸放射科医生,以更严格地评估该模型在特定模式(CT 胸部检查)上的性能,并以放射科专家为基线。我们根据诊断类别和原始印象长度对读者表现研究的结果进行分层,以衡量病例的复杂性。

结果

该 LLM 在 UCSFMC 上的 ROUGE-L 得分分别为 46.51、44.2 和 50.96,并且在外部验证时,在 CT、US 和 MRI 模式下,ZSFG 的 ROUGE-L 得分分别为 40.74、37.89 和 24.61,这意味着模型生成的印象与放射科专家主治医生撰写的印象有很大程度的重叠,但在外部验证时存在一定程度的退化。在我们的读者研究中,模型生成的印象总体平均得分为 3.56/4、3.92/4、3.37/4、18.29 秒、12.32 个单词和 84,而由放射科专家撰写的原始印象的总体平均得分为 3.75/4、3.87/4、3.54/4、12.2 秒、5.74 个单词和 89,分别用于临床准确性、语法准确性、文体质量、编辑时间、编辑距离和 ROUGE-L 分数。该 LLM 在急性/紧急发现和较短的印象方面获得了最高的临床准确性评分。

结论

经过微调的开源 LLM 可以生成具有令人满意的临床准确性、语法准确性和文体质量的印象。我们的读者表现研究表明,大型语言模型在起草放射学报告印象方面具有潜力,可以帮助简化放射科医生的工作流程。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/531b/11428559/680d8e075ace/12880_2024_1435_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/531b/11428559/987c81db35c4/12880_2024_1435_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/531b/11428559/3d67742b7968/12880_2024_1435_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/531b/11428559/917848333681/12880_2024_1435_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/531b/11428559/680d8e075ace/12880_2024_1435_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/531b/11428559/987c81db35c4/12880_2024_1435_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/531b/11428559/3d67742b7968/12880_2024_1435_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/531b/11428559/917848333681/12880_2024_1435_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/531b/11428559/680d8e075ace/12880_2024_1435_Fig6_HTML.jpg

相似文献

1
An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study.开源微调大型语言模型在放射科印象生成中的应用:多读者性能研究。
BMC Med Imaging. 2024 Sep 27;24(1):254. doi: 10.1186/s12880-024-01435-w.
2
Constructing a Large Language Model to Generate Impressions from Findings in Radiology Reports.构建一个大型语言模型,根据放射科报告中的发现生成印象。
Radiology. 2024 Sep;312(3):e240885. doi: 10.1148/radiol.240885.
3
Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: A feasibility study.评估大语言模型在总结 MRI 膝关节影像学报告方面的表现与人类相比的性能:一项可行性研究。
Int J Med Inform. 2024 Jul;187:105443. doi: 10.1016/j.ijmedinf.2024.105443. Epub 2024 Apr 4.
4
Personalized Impression Generation for PET Reports Using Large Language Models.基于大语言模型的 PET 报告个性化印象生成。
J Imaging Inform Med. 2024 Apr;37(2):471-488. doi: 10.1007/s10278-024-00985-3. Epub 2024 Feb 2.
5
Automatic generation of conclusions from neuroradiology MRI reports through natural language processing.通过自然语言处理自动生成神经放射学 MRI 报告的结论。
Neuroradiology. 2024 Apr;66(4):477-485. doi: 10.1007/s00234-024-03312-3. Epub 2024 Feb 21.
6
From jargon to clarity: Improving the readability of foot and ankle radiology reports with an artificial intelligence large language model.从行话到清晰明了:利用人工智能大语言模型提高足踝放射学报告的可读性
Foot Ankle Surg. 2024 Jun;30(4):331-337. doi: 10.1016/j.fas.2024.01.008. Epub 2024 Feb 5.
7
Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports.开源大语言模型从自由文本放射学报告中提取信息的性能。
Radiol Artif Intell. 2024 Jul;6(4):e230364. doi: 10.1148/ryai.230364.
8
Between Always and Never: Evaluating Uncertainty in Radiology Reports Using Natural Language Processing.在“总是”和“从不”之间:使用自然语言处理评估放射学报告中的不确定性。
J Digit Imaging. 2020 Oct;33(5):1194-1201. doi: 10.1007/s10278-020-00379-1. Epub 2020 Aug 19.
9
Fine-Tuned Large Language Model for Extracting Patients on Pretreatment for Lung Cancer from a Picture Archiving and Communication System Based on Radiological Reports.基于放射学报告从图像存档与通信系统中提取肺癌预处理患者的微调大语言模型
J Imaging Inform Med. 2025 Feb;38(1):327-334. doi: 10.1007/s10278-024-01186-8. Epub 2024 Jul 2.
10
Automated classification of brain MRI reports using fine-tuned large language models.使用微调后的大语言模型对脑部磁共振成像报告进行自动分类
Neuroradiology. 2024 Dec;66(12):2177-2183. doi: 10.1007/s00234-024-03427-7. Epub 2024 Jul 12.

引用本文的文献

1
Foundation models for radiology-the position of the AI for Health Imaging (AI4HI) network.放射学基础模型——健康影像人工智能(AI4HI)网络的立场
Insights Imaging. 2025 Aug 6;16(1):168. doi: 10.1186/s13244-025-02056-9.
2
[Potential applications of large language models in trauma surgery : Opportunities, risks and perspectives].[大语言模型在创伤外科中的潜在应用:机遇、风险与展望]
Unfallchirurgie (Heidelb). 2025 May 12. doi: 10.1007/s00113-025-01581-y.
3
Fine-Tuning Large Language Models for Specialized Use Cases.针对特定用例微调大语言模型。

本文引用的文献

1
Feasibility of Using the Privacy-preserving Large Language Model Vicuna for Labeling Radiology Reports.使用隐私保护的大型语言模型 Vicuna 对放射科报告进行标注的可行性研究。
Radiology. 2023 Oct;309(1):e231147. doi: 10.1148/radiol.231147.
2
The shaky foundations of large language models and foundation models for electronic health records.用于电子健康记录的大语言模型和基础模型的不稳定基础。
NPJ Digit Med. 2023 Jul 29;6(1):135. doi: 10.1038/s41746-023-00879-8.
3
Evaluating GPT4 on Impressions Generation in Radiology Reports.评估GPT4在生成放射学报告印象方面的表现。
Mayo Clin Proc Digit Health. 2024 Nov 29;3(1):100184. doi: 10.1016/j.mcpdig.2024.11.005. eCollection 2025 Mar.
Radiology. 2023 Jun;307(5):e231259. doi: 10.1148/radiol.231259.
4
How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard.人工智能如何回答常见肺癌问题:ChatGPT 与 Google Bard 对比。
Radiology. 2023 Jun;307(5):e230922. doi: 10.1148/radiol.230922.
5
Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations.ChatGPT 在放射科 Board 考试中的表现:当前优势和局限性的深入了解。
Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. Epub 2023 May 16.
6
Ethics of large language models in medicine and medical research.医学及医学研究中大型语言模型的伦理问题。
Lancet Digit Health. 2023 Jun;5(6):e333-e335. doi: 10.1016/S2589-7500(23)00083-3. Epub 2023 Apr 27.
7
Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study.利用GPT-4将自由文本放射学报告进行事后转换为结构化报告:一项多语言可行性研究。
Radiology. 2023 May;307(4):e230725. doi: 10.1148/radiol.230725. Epub 2023 Apr 4.
8
ChatGPT and Other Large Language Models Are Double-edged Swords.ChatGPT和其他大型语言模型是双刃剑。
Radiology. 2023 Apr;307(2):e230163. doi: 10.1148/radiol.230163. Epub 2023 Jan 26.
9
How to Create a Great Radiology Report.如何撰写优质的放射科报告
Radiographics. 2020 Oct;40(6):1658-1670. doi: 10.1148/rg.2020200020.
10
Array programming with NumPy.使用 NumPy 进行数组编程。
Nature. 2020 Sep;585(7825):357-362. doi: 10.1038/s41586-020-2649-2. Epub 2020 Sep 16.