Li HongYi, Fu Jun-Fen, Python Andre
Center for Data Science, Zhejiang University, Hangzhou, China.
School of Mathematical Sciences, Zhejiang University, Hangzhou, China.
J Med Internet Res. 2025 Jul 11;27:e71916. doi: 10.2196/71916.
Large language models (LLMs) can generate outputs understandable by humans, such as answers to medical questions and radiology reports. With the rapid development of LLMs, clinicians face a growing challenge in determining the most suitable algorithms to support their work.
We aimed to provide clinicians and other health care practitioners with systematic guidance in selecting an LLM that is relevant and appropriate to their needs and facilitate the integration process of LLMs in health care.
We conducted a literature search of full-text publications in English on clinical applications of LLMs published between January 1, 2022, and March 31, 2025, on PubMed, ScienceDirect, Scopus, and IEEE Xplore. We excluded papers from journals below a set citation threshold, as well as papers that did not focus on LLMs, were not research based, or did not involve clinical applications. We also conducted a literature search on arXiv within the same investigated period and included papers on the clinical applications of innovative multimodal LLMs. This led to a total of 270 studies.
We collected 330 LLMs and recorded their application frequency in clinical tasks and frequency of best performance in their context. On the basis of a 5-stage clinical workflow, we found that stages 2, 3, and 4 are key stages in the clinical workflow, involving numerous clinical subtasks and LLMs. However, the diversity of LLMs that may perform optimally in each context remains limited. GPT-3.5 and GPT-4 were the most versatile models in the 5-stage clinical workflow, applied to 52% (29/56) and 71% (40/56) of the clinical subtasks, respectively, and they performed best in 29% (16/56) and 54% (30/56) of the clinical subtasks, respectively. General-purpose LLMs may not perform well in specialized areas as they often require lightweight prompt engineering methods or fine-tuning techniques based on specific datasets to improve model performance. Most LLMs with multimodal abilities are closed-source models and, therefore, lack of transparency, model customization, and fine-tuning for specific clinical tasks and may also pose challenges regarding data protection and privacy, which are common requirements in clinical settings.
In this review, we found that LLMs may help clinicians in a variety of clinical tasks. However, we did not find evidence of generalist clinical LLMs successfully applicable to a wide range of clinical tasks. Therefore, their clinical deployment remains challenging. On the basis of this review, we propose an interactive online guideline for clinicians to select suitable LLMs by clinical task. With a clinical perspective and free of unnecessary technical jargon, this guideline may be used as a reference to successfully apply LLMs in clinical settings.
大语言模型(LLMs)能够生成人类可理解的输出,如医学问题的答案和放射学报告。随着大语言模型的快速发展,临床医生在确定最适合支持其工作的算法方面面临着越来越大的挑战。
我们旨在为临床医生和其他医疗从业者提供系统的指导,以选择与其需求相关且合适的大语言模型,并促进大语言模型在医疗保健中的整合过程。
我们在PubMed、ScienceDirect、Scopus和IEEE Xplore上对2022年1月1日至2025年3月31日期间发表的关于大语言模型临床应用的英文全文出版物进行了文献检索。我们排除了低于设定引用阈值的期刊论文,以及那些未专注于大语言模型、非基于研究或未涉及临床应用的论文。我们还在同一调查期内在arXiv上进行了文献检索,并纳入了关于创新多模态大语言模型临床应用的论文。这导致总共270项研究。
我们收集了330个大语言模型,并记录了它们在临床任务中的应用频率以及在其背景下的最佳性能频率。基于5阶段临床工作流程,我们发现第2、3和4阶段是临床工作流程中的关键阶段,涉及众多临床子任务和大语言模型。然而,在每个背景下可能表现最佳的大语言模型的多样性仍然有限。GPT-3.5和GPT-4是5阶段临床工作流程中最通用的模型,分别应用于52%(29/56)和71%(40/56)的临床子任务,并且它们分别在29%(16/56)和54%(30/56)的临床子任务中表现最佳。通用大语言模型在专业领域可能表现不佳,因为它们通常需要基于特定数据集的轻量级提示工程方法或微调技术来提高模型性能。大多数具有多模态能力的大语言模型是闭源模型,因此缺乏透明度、模型定制以及针对特定临床任务的微调,并且在数据保护和隐私方面也可能带来挑战,而这是临床环境中的常见要求。
在本综述中,我们发现大语言模型可能有助于临床医生完成各种临床任务。然而,我们没有发现通用临床大语言模型成功适用于广泛临床任务的证据。因此,它们在临床中的部署仍然具有挑战性。基于本综述,我们为临床医生提出了一个交互式在线指南,以便根据临床任务选择合适的大语言模型。从临床角度出发且没有不必要的技术行话,该指南可作为在临床环境中成功应用大语言模型的参考。