Du Xinsong, Zhou Zhengyang, Wang Yifei, Chuang Ya-Wen, Li Yiming, Yang Richard, Hong Pengyu, Bates David W, Zhou Li
Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA 02115, United States; Department of Medicine, Harvard Medical School, Boston, MA 02115, United States.
Department of Computer Science, Brandeis University, Waltham, MA 02453, United States.
Int J Med Inform. 2025 Aug 28;205:106091. doi: 10.1016/j.ijmedinf.2025.106091.
To synthesize performance and improvement strategies for adapting generative LLMs in EHR analyses and applications.
We followed the PRISMA guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023 and November 9, 2024. Multiple reviewers including biomedical informaticians and a clinician involved in the article reviewing process. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations for an improvement technique. The review identified key clinical applications, summarized performance and the improvement strategies.
Of the 18,735 articles retrieved, 196 met our criteria. 112 (57.1%) studies used generative LLMs for clinical decision support tasks, 40 (20.4%) studies involved documentation tasks, 39 (19.9%) studies involved information extraction tasks, 11 (5.6%) studies involved patient communication tasks, and 10 (5.1%) studies included summarization tasks. Among the 196 studies, most studies (88.8%) did not quantitatively evaluate the LLM performance improvement strategies, with the rest twenty-four studies (12.2%) quantitatively evaluated the effectiveness of in-context learning (9 studies), fine-tuning (12 studies), multimodal integration (8 studies), and ensemble learning (2 studies). Three studies highlighted that few-shot prompting, fine-tuning, and multimodal data integration might not improve performance, and another two studies found that fine-tuning a smaller model could outperform a large model.
Applying a performance improvement strategy may not necessarily lead to performance improvement, and detailed guidelines regarding how to apply those strategies more effectively and safely are needed, which can be completed from more quantitative analysis in the future.
综合生成式大语言模型(LLMs)在电子健康记录(EHR)分析和应用中的性能及改进策略。
我们遵循PRISMA指南,对2023年1月1日至2024年11月9日期间在PubMed和Web of Science上发表的文章进行系统综述。多名评审人员参与文章评审过程,包括生物医学信息学家和一名临床医生。若研究使用生成式LLMs分析真实世界的EHR数据并报告改进技术的定量性能评估,则纳入该研究。本综述确定了关键临床应用,总结了性能及改进策略。
在检索到的18735篇文章中,196篇符合我们的标准。112项(57.1%)研究将生成式LLMs用于临床决策支持任务,40项(20.4%)研究涉及文档任务,39项(19.9%)研究涉及信息提取任务,11项(5.6%)研究涉及患者沟通任务,10项(5.1%)研究包括摘要任务。在这196项研究中,大多数研究(88.8%)未对LLM性能改进策略进行定量评估,其余24项研究(12.2%)对上下文学习(9项研究)、微调(12项研究)、多模态整合(8项研究)和集成学习(2项研究)的有效性进行了定量评估。三项研究强调少样本提示、微调及多模态数据整合可能不会提高性能,另外两项研究发现对较小模型进行微调可能优于大模型。
应用性能改进策略不一定会带来性能提升,需要关于如何更有效、安全地应用这些策略的详细指南,这可在未来通过更多定量分析来完善。