Dorfner Felix J, Dada Amin, Busch Felix, Makowski Marcus R, Han Tianyu, Truhn Daniel, Kleesiek Jens, Sushil Madhumita, Adams Lisa C, Bressem Keno K
Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin 10117, Germany.
Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School, Charlestown, MA 02129, United States.
J Am Med Inform Assoc. 2025 Jun 1;32(6):1015-1024. doi: 10.1093/jamia/ocaf045.
Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks.
We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities.
Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4% vs Llama-3-70B-Instruct: 65% on JAMA), smaller biomedical models showed more pronounced underperformance (OpenBioLLM-8B: 30% vs Llama-3-8B-Instruct: 64.3% on NEJM). Similar trends appeared across CLUE benchmarks, with general-purpose models often achieving higher scores in text generation, question answering, and coding. Notably, biomedical LLMs also showed a higher tendency to hallucinate.
Our findings challenge the assumption that biomedical fine-tuning inherently improves LLM performance, as general-purpose models consistently performed better on unseen medical tasks. Retrieval-augmented generation may offer a more effective strategy for clinical adaptation.
Fine-tuning LLMs on biomedical data may not yield the anticipated benefits. Alternative approaches, such as retrieval augmentation, should be further explored for effective and reliable clinical integration of LLMs.
大语言模型(LLMs)在生物医学应用中已展现出潜力,促使人们尝试在特定领域数据上对其进行微调。然而,这种方法的有效性仍不明确。本研究旨在严格评估经生物医学微调的大语言模型在一系列临床任务中相对于通用模型的性能。
我们在来自《新英格兰医学杂志》(NEJM)和《美国医学会杂志》(JAMA)的临床病例挑战以及多个临床任务(如信息提取、文档摘要和临床编码)上,评估了经生物医学微调的大语言模型相对于通用模型的性能。我们使用了一组经过精心挑选的、大概率不在生物医学模型可能的微调数据集中的多样化基准,以确保对泛化能力进行公平评估。
与通用模型相比,生物医学大语言模型总体表现较差,尤其是在那些并非专注于考查医学知识的任务上。在病例挑战方面,较大的生物医学模型和通用模型表现相似(例如,在JAMA上,OpenBioLLM - 70B的正确率为66.4%,而Llama - 3 - 70B - Instruct为65%),但较小的生物医学模型表现出更明显的劣势(在NEJM上,OpenBioLLM - 8B的正确率为30%,而Llama - 3 - 8B - Instruct为64.3%)。在CLUE基准测试中也出现了类似趋势,通用模型在文本生成、问答和编码方面通常得分更高。值得注意的是,生物医学大语言模型产生幻觉的倾向也更高。
我们的研究结果对生物医学微调必然会提高大语言模型性能这一假设提出了挑战,因为通用模型在未见过的医学任务上始终表现得更好。检索增强生成可能为临床适应提供一种更有效的策略。
在生物医学数据上对大语言模型进行微调可能无法带来预期的益处。应进一步探索替代方法,如检索增强,以实现大语言模型在临床中有效且可靠的整合。