• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在文献综述中增强人工智能用于文献筛选:使用集成模型提高准确性。

Enhancing AI for citation screening in literature reviews: Improving accuracy with ensemble models.

作者信息

Zhang Zhihong, Momeni Nezhad Mohamad Javad, Gupta Pallavi, Zolnour Ali, Azadmaleki Hossein, Topaz Maxim, Zolnoori Maryam

机构信息

Data Science Institute, Columbia University, New York, NY 10027, USA; School of Nursing, Columbia University, New York, NY 10032, USA.

Columbia University Irving Medical Center, New York, NY 10032, USA.

出版信息

Int J Med Inform. 2025 Nov;203:106035. doi: 10.1016/j.ijmedinf.2025.106035. Epub 2025 Jul 1.

DOI:10.1016/j.ijmedinf.2025.106035
PMID:40609462
Abstract

BACKGROUND

Healthcare literature reviews underpin evidence-based practice and clinical guideline development, with citation screening as a critical yet time-consuming step. This study evaluates the effectiveness of individual large language models (LLMs) versus ensemble approaches in automating citation screening to improve the efficiency and scalability of evidence synthesis in healthcare research.

METHODS

Performance was assessed across three healthcare-focused reviews: LLM-Healthcare (865 citations, broad scope, 49.8 % inclusion rate), MCI-Speech (959 citations, narrow scope, 6.5 % inclusion rate), and Multimodal-LLM (73 citations, moderate scope, 68.5 % inclusion rate). Six LLMs (GPT-4o Mini, GPT-4o, Gemini Flash, Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, Llama 3.1 405B Instruct) were evaluated using zero- and few-shot learning strategies with PubMedBERT for demonstration selection. We compared individual model performance with ensemble methods, including majority voting and random forest (RF), based on sensitivity and specificity.

RESULTS

No individual LLM consistently outperformed others across all tasks. Review with narrow inclusion criteria and low inclusion rates exhibited high specificity but lower sensitivity. Ensemble methods consistently surpassed individual LLMs: the RF ensemble with GPT-4o performed best in LLM-Healthcare (sensitivity: 0.96, specificity: 0.89); the majority voting with 1-shot LLMs (sensitivity: 0.75, specificity: 0.86) and RF ensemble with 4-shot LLMs (sensitivity: 0.62, specificity: 0.97) excelled in MCI-Speech; and four RF ensembles achieved perfect classification (sensitivity: 1.0, specificity: 1.0) in Multimodal-LLM.

CONCLUSION

Ensemble approaches improve individual LLMs' performances in citation screening across diverse healthcare review tasks, highlighting their potential to enhance evidence synthesis workflows that support clinical decision-making. However, broader validation is needed before real-world implementation.

摘要

背景

医疗保健文献综述是循证实践和临床指南制定的基础,而文献筛选是关键但耗时的一步。本研究评估了单个大语言模型(LLMs)与集成方法在自动化文献筛选方面的有效性,以提高医疗保健研究中证据综合的效率和可扩展性。

方法

在三项以医疗保健为重点的综述中评估性能:LLM - Healthcare(865篇文献,范围广泛,纳入率49.8%)、MCI - Speech(959篇文献,范围狭窄,纳入率6.5%)和Multimodal - LLM(73篇文献,范围适中,纳入率68.5%)。使用零样本和少样本学习策略,结合PubMedBERT对六个大语言模型(GPT - 4o Mini、GPT - 4o、Gemini Flash、Llama 3.1 8B Instruct、Llama 3.1 70B Instruct、Llama 3.1 405B Instruct)进行评估以进行示范选择。基于敏感性和特异性,我们将单个模型的性能与集成方法(包括多数投票和随机森林(RF))进行了比较。

结果

在所有任务中,没有单个大语言模型始终优于其他模型。纳入标准狭窄且纳入率低的综述表现出高特异性但较低的敏感性。集成方法始终优于单个大语言模型:在LLM - Healthcare中,与GPT - 4o的随机森林集成表现最佳(敏感性:0.96,特异性:0.89);在MCI - Speech中,单样本大语言模型的多数投票(敏感性:0.75,特异性:0.86)和四样本大语言模型的随机森林集成(敏感性:0.62,特异性:0.97)表现出色;在Multimodal - LLM中,四个随机森林集成实现了完美分类(敏感性:1.0,特异性:1.0)。

结论

集成方法在不同医疗保健综述任务的文献筛选中提高了单个大语言模型的性能,凸显了它们在增强支持临床决策的证据综合工作流程方面的潜力。然而,在实际应用之前还需要更广泛的验证。

相似文献

1
Enhancing AI for citation screening in literature reviews: Improving accuracy with ensemble models.在文献综述中增强人工智能用于文献筛选:使用集成模型提高准确性。
Int J Med Inform. 2025 Nov;203:106035. doi: 10.1016/j.ijmedinf.2025.106035. Epub 2025 Jul 1.
2
Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.使用GPT-4o和Llama-3.3-70B从自由文本中风CT报告中提取数据:注释指南的影响
Eur Radiol Exp. 2025 Jun 19;9(1):61. doi: 10.1186/s41747-025-00600-2.
3
High-performance automated abstract screening with large language model ensembles.使用大语言模型集成进行高性能自动摘要筛选。
J Am Med Inform Assoc. 2025 May 1;32(5):893-904. doi: 10.1093/jamia/ocaf050.
4
Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study.用于医学问答集成学习的大语言模型协同作用:设计与评估研究
J Med Internet Res. 2025 Jul 14;27:e70080. doi: 10.2196/70080.
5
Evaluating large language model performance to support the diagnosis and management of patients with primary immune disorders.评估大型语言模型的性能以支持原发性免疫疾病患者的诊断和管理。
J Allergy Clin Immunol. 2025 Feb 14. doi: 10.1016/j.jaci.2025.02.004.
6
Classifying Patient Complaints Using Artificial Intelligence-Powered Large Language Models: Cross-Sectional Study.使用人工智能驱动的大语言模型对患者投诉进行分类:横断面研究
J Med Internet Res. 2025 Aug 6;27:e74231. doi: 10.2196/74231.
7
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
8
Performance of LLMs in Citation Screening: A Comparison Across Datasets with Varied Inclusion Rates.大语言模型在文献筛选中的表现:不同纳入率数据集的比较
Stud Health Technol Inform. 2025 Aug 7;329:1886-1887. doi: 10.3233/SHTI251264.
9
Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders.与人工编码员相比,评估检索增强型大语言模型在急诊科ICD-10-CM编码中的性能。
medRxiv. 2024 Oct 17:2024.10.15.24315526. doi: 10.1101/2024.10.15.24315526.
10
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.