• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估生物医学微调对大语言模型在临床任务上的有效性。

Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks.

作者信息

Dorfner Felix J, Dada Amin, Busch Felix, Makowski Marcus R, Han Tianyu, Truhn Daniel, Kleesiek Jens, Sushil Madhumita, Adams Lisa C, Bressem Keno K

机构信息

Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin 10117, Germany.

Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School, Charlestown, MA 02129, United States.

出版信息

J Am Med Inform Assoc. 2025 Jun 1;32(6):1015-1024. doi: 10.1093/jamia/ocaf045.

DOI:10.1093/jamia/ocaf045
PMID:40190132
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12089759/
Abstract

OBJECTIVES

Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks.

MATERIALS AND METHODS

We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities.

RESULTS

Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4% vs Llama-3-70B-Instruct: 65% on JAMA), smaller biomedical models showed more pronounced underperformance (OpenBioLLM-8B: 30% vs Llama-3-8B-Instruct: 64.3% on NEJM). Similar trends appeared across CLUE benchmarks, with general-purpose models often achieving higher scores in text generation, question answering, and coding. Notably, biomedical LLMs also showed a higher tendency to hallucinate.

DISCUSSION

Our findings challenge the assumption that biomedical fine-tuning inherently improves LLM performance, as general-purpose models consistently performed better on unseen medical tasks. Retrieval-augmented generation may offer a more effective strategy for clinical adaptation.

CONCLUSION

Fine-tuning LLMs on biomedical data may not yield the anticipated benefits. Alternative approaches, such as retrieval augmentation, should be further explored for effective and reliable clinical integration of LLMs.

摘要

目的

大语言模型(LLMs)在生物医学应用中已展现出潜力,促使人们尝试在特定领域数据上对其进行微调。然而,这种方法的有效性仍不明确。本研究旨在严格评估经生物医学微调的大语言模型在一系列临床任务中相对于通用模型的性能。

材料与方法

我们在来自《新英格兰医学杂志》(NEJM)和《美国医学会杂志》(JAMA)的临床病例挑战以及多个临床任务(如信息提取、文档摘要和临床编码)上,评估了经生物医学微调的大语言模型相对于通用模型的性能。我们使用了一组经过精心挑选的、大概率不在生物医学模型可能的微调数据集中的多样化基准,以确保对泛化能力进行公平评估。

结果

与通用模型相比,生物医学大语言模型总体表现较差,尤其是在那些并非专注于考查医学知识的任务上。在病例挑战方面,较大的生物医学模型和通用模型表现相似(例如,在JAMA上,OpenBioLLM - 70B的正确率为66.4%,而Llama - 3 - 70B - Instruct为65%),但较小的生物医学模型表现出更明显的劣势(在NEJM上,OpenBioLLM - 8B的正确率为30%,而Llama - 3 - 8B - Instruct为64.3%)。在CLUE基准测试中也出现了类似趋势,通用模型在文本生成、问答和编码方面通常得分更高。值得注意的是,生物医学大语言模型产生幻觉的倾向也更高。

讨论

我们的研究结果对生物医学微调必然会提高大语言模型性能这一假设提出了挑战,因为通用模型在未见过的医学任务上始终表现得更好。检索增强生成可能为临床适应提供一种更有效的策略。

结论

在生物医学数据上对大语言模型进行微调可能无法带来预期的益处。应进一步探索替代方法,如检索增强,以实现大语言模型在临床中有效且可靠的整合。

相似文献

1
Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks.评估生物医学微调对大语言模型在临床任务上的有效性。
J Am Med Inform Assoc. 2025 Jun 1;32(6):1015-1024. doi: 10.1093/jamia/ocaf045.
2
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
3
Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese.迈向自然语言处理系统的跨医院部署:用于日语疾病名称识别的微调大语言模型的模型开发与验证
JMIR Med Inform. 2025 Jul 8;13:e76773. doi: 10.2196/76773.
4
Fine-tuning medical language models for enhanced long-contextual understanding and domain expertise.微调医学语言模型以增强长上下文理解和领域专业知识。
Quant Imaging Med Surg. 2025 Jun 6;15(6):5450-5462. doi: 10.21037/qims-2024-2655. Epub 2025 Jun 3.
5
Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.评估和提高大语言模型中的辨证思维能力:方法开发研究
JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.
6
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
7
Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset.评估和增强用于遗传咨询支持的日本大语言模型:领域适应的比较研究与专家评估数据集的开发
JMIR Med Inform. 2025 Jan 16;13:e65047. doi: 10.2196/65047.
8
Automated Extraction of Patient-Centered Outcomes After Breast Cancer Treatment: An Open-Source Large Language Model-Based Toolkit.基于开源大语言模型的乳腺癌治疗后患者为中心结局自动提取工具包。
JCO Clin Cancer Inform. 2024 Aug;8:e2300258. doi: 10.1200/CCI.23.00258.
9
Using Generative Artificial Intelligence in Health Economics and Outcomes Research: A Primer on Techniques and Breakthroughs.在卫生经济学与结果研究中使用生成式人工智能:技术与突破入门
Pharmacoecon Open. 2025 Apr 29. doi: 10.1007/s41669-025-00580-4.
10
Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism.利用大语言模型检测医院获得性疾病:关于肺栓塞的实证研究
J Am Med Inform Assoc. 2025 May 1;32(5):876-884. doi: 10.1093/jamia/ocaf048.

引用本文的文献

1
Comparative evaluation of large language models in delivering guideline-compliant recommendations for topical NSAID use in musculoskeletal pain: a multidimensional analysis.大型语言模型在提供符合指南的肌肉骨骼疼痛局部使用非甾体抗炎药建议方面的比较评估:多维分析
Clin Rheumatol. 2025 Sep 15. doi: 10.1007/s10067-025-07640-4.
2
SHREC: A Framework for Advancing Next-Generation Computational Phenotyping with Large Language Models.SHREC:一个利用大语言模型推进下一代计算表型分析的框架。
ArXiv. 2025 Jul 17:arXiv:2506.16359v3.
3
Applicability Assessment of Technologies for Predictive and Prescriptive Analytics of Nephrology Big Data.肾脏病大数据预测性与规范性分析技术的适用性评估
Proteomics. 2025 Jun;25(11-12):e202400135. doi: 10.1002/pmic.202400135. Epub 2025 May 27.
4
Harnessing the power of large language models for clinical tasks and synthesis of scientific literature.利用大语言模型的能力来完成临床任务和综合科学文献。
J Am Med Inform Assoc. 2025 Jun 1;32(6):983-984. doi: 10.1093/jamia/ocaf071.

本文引用的文献

1
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.大语言模型对诊断推理的影响:一项随机临床试验。
JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.
2
Generative artificial intelligence in primary care: an online survey of UK general practitioners.初级保健中的生成式人工智能:英国全科医生的在线调查。
BMJ Health Care Inform. 2024 Sep 17;31(1):e101102. doi: 10.1136/bmjhci-2024-101102.
3
PMC-LLaMA: toward building open-source language models for medicine.PMC-LLaMA:为医学构建开源语言模型的努力。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1833-1843. doi: 10.1093/jamia/ocae045.
4
Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions.多模态大语言模型在临床病例问题上的性能比较分析
JAMA. 2024 Apr 16;331(15):1320-1321. doi: 10.1001/jama.2023.27861.
5
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
6
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.
7
Summarizing Patients' Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models.使用预训练的序列到序列模型从医院病程记录中总结患者问题
Proc Int Conf Comput Ling. 2022 Oct;2022:2979-2991.
8
AI in health and medicine.人工智能在医疗中的应用。
Nat Med. 2022 Jan;28(1):31-38. doi: 10.1038/s41591-021-01614-0. Epub 2022 Jan 20.
9
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
10
Overcoming catastrophic forgetting in neural networks.克服神经网络中的灾难性遗忘。
Proc Natl Acad Sci U S A. 2017 Mar 28;114(13):3521-3526. doi: 10.1073/pnas.1611835114. Epub 2017 Mar 14.