Zhang Zhihong, Momeni Nezhad Mohamad Javad, Gupta Pallavi, Zolnour Ali, Azadmaleki Hossein, Topaz Maxim, Zolnoori Maryam
Data Science Institute, Columbia University, New York, NY 10027, USA; School of Nursing, Columbia University, New York, NY 10032, USA.
Columbia University Irving Medical Center, New York, NY 10032, USA.
Int J Med Inform. 2025 Nov;203:106035. doi: 10.1016/j.ijmedinf.2025.106035. Epub 2025 Jul 1.
Healthcare literature reviews underpin evidence-based practice and clinical guideline development, with citation screening as a critical yet time-consuming step. This study evaluates the effectiveness of individual large language models (LLMs) versus ensemble approaches in automating citation screening to improve the efficiency and scalability of evidence synthesis in healthcare research.
Performance was assessed across three healthcare-focused reviews: LLM-Healthcare (865 citations, broad scope, 49.8 % inclusion rate), MCI-Speech (959 citations, narrow scope, 6.5 % inclusion rate), and Multimodal-LLM (73 citations, moderate scope, 68.5 % inclusion rate). Six LLMs (GPT-4o Mini, GPT-4o, Gemini Flash, Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, Llama 3.1 405B Instruct) were evaluated using zero- and few-shot learning strategies with PubMedBERT for demonstration selection. We compared individual model performance with ensemble methods, including majority voting and random forest (RF), based on sensitivity and specificity.
No individual LLM consistently outperformed others across all tasks. Review with narrow inclusion criteria and low inclusion rates exhibited high specificity but lower sensitivity. Ensemble methods consistently surpassed individual LLMs: the RF ensemble with GPT-4o performed best in LLM-Healthcare (sensitivity: 0.96, specificity: 0.89); the majority voting with 1-shot LLMs (sensitivity: 0.75, specificity: 0.86) and RF ensemble with 4-shot LLMs (sensitivity: 0.62, specificity: 0.97) excelled in MCI-Speech; and four RF ensembles achieved perfect classification (sensitivity: 1.0, specificity: 1.0) in Multimodal-LLM.
Ensemble approaches improve individual LLMs' performances in citation screening across diverse healthcare review tasks, highlighting their potential to enhance evidence synthesis workflows that support clinical decision-making. However, broader validation is needed before real-world implementation.
医疗保健文献综述是循证实践和临床指南制定的基础,而文献筛选是关键但耗时的一步。本研究评估了单个大语言模型(LLMs)与集成方法在自动化文献筛选方面的有效性,以提高医疗保健研究中证据综合的效率和可扩展性。
在三项以医疗保健为重点的综述中评估性能:LLM - Healthcare(865篇文献,范围广泛,纳入率49.8%)、MCI - Speech(959篇文献,范围狭窄,纳入率6.5%)和Multimodal - LLM(73篇文献,范围适中,纳入率68.5%)。使用零样本和少样本学习策略,结合PubMedBERT对六个大语言模型(GPT - 4o Mini、GPT - 4o、Gemini Flash、Llama 3.1 8B Instruct、Llama 3.1 70B Instruct、Llama 3.1 405B Instruct)进行评估以进行示范选择。基于敏感性和特异性,我们将单个模型的性能与集成方法(包括多数投票和随机森林(RF))进行了比较。
在所有任务中,没有单个大语言模型始终优于其他模型。纳入标准狭窄且纳入率低的综述表现出高特异性但较低的敏感性。集成方法始终优于单个大语言模型:在LLM - Healthcare中,与GPT - 4o的随机森林集成表现最佳(敏感性:0.96,特异性:0.89);在MCI - Speech中,单样本大语言模型的多数投票(敏感性:0.75,特异性:0.86)和四样本大语言模型的随机森林集成(敏感性:0.62,特异性:0.97)表现出色;在Multimodal - LLM中,四个随机森林集成实现了完美分类(敏感性:1.0,特异性:1.0)。
集成方法在不同医疗保健综述任务的文献筛选中提高了单个大语言模型的性能,凸显了它们在增强支持临床决策的证据综合工作流程方面的潜力。然而,在实际应用之前还需要更广泛的验证。