Cao Christian, Sang Jason, Arora Rohit, Chen David, Kloosterman Robert, Cecere Matthew, Gorla Jaswanth, Saleh Richard, Drennan Ian, Teja Bijan, Fehlings Michael, Ronksley Paul, Leung Alexander A, Weisz Dany E, Ware Harriet, Whelan Mairead, Emerson David B, Arora Rahul K, Bobrovitz Niklas
Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, and Centre for Health Informatics, Department of Community Health Sciences, University of Calgary, Calgary, Alberta, Canada (C.C.).
Stripe, San Francisco, California (J.S.).
Ann Intern Med. 2025 Mar;178(3):389-401. doi: 10.7326/ANNALS-24-02189. Epub 2025 Feb 25.
BACKGROUND: Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis. OBJECTIVE: To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews. DESIGN: Diagnostic test accuracy. SETTING: 48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI). PARTICIPANTS: None. MEASUREMENTS: Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity). RESULTS: Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD. LIMITATIONS: Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles. CONCLUSION: A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences. PRIMARY FUNDING SOURCE: None.
背景:系统评价(SRs)受到初始严格文章筛选的阻碍,这延迟了获取可靠信息综合的时间。 目的:开发适用于不同评价的、由大语言模型(LLM)驱动的摘要和全文筛选通用提示模板。 设计:诊断试验准确性研究。 设置:对48425条引文进行了10项系统评价的摘要筛选测试。全文筛选评估了原始搜索中所有12690篇可免费获取的文章。提示开发使用GPT4-0125-preview模型(OpenAI)。 参与者:无。 测量:根据系统评价纳入标准,提示大语言模型纳入或排除文章。在全文筛选后,将模型输出与原始系统评价作者的决定进行比较,以评估性能(准确性、敏感性和特异性)。 结果:使用GPT4-0125-preview优化后的提示在10项系统评价的摘要筛选中加权敏感性为97.7%(范围86.7%至100%),特异性为85.2%(范围68.3%至95.9%);在全文筛选中加权敏感性为96.5%(范围89.7%至100.0%),特异性为91.2%(范围8~7%至100%)。相比之下,零样本提示的敏感性较差(摘要筛选为49.0%,全文筛选为49.1%)。在所有大语言模型中,Claude-3.5(Anthropic)和GPT4变体性能相似,而Gemini Pro(谷歌)和GPT3.5(OpenAI)模型表现较差。10000条引文的直接筛选成本差异很大:单人进行摘要筛选估计需要超过83小时,费用为1666.67美元,而我们基于大语言模型的方法在不到1天的时间内完成筛选,费用为157.02美元。 局限性:可能存在进一步优化提示的方法。回顾性研究。系统评价的便利样本。全文筛选评估仅限于免费的PubMed Central全文文章。 结论:开发了一种适用于摘要和全文筛选且具有高敏感性和特异性的通用提示,可应用于其他系统评价和大语言模型。我们的提示创新可能对系统评价研究者以及医学领域进行类似基于标准任务的研究人员具有价值。 主要资金来源:无。
Cochrane Database Syst Rev. 2022-2-1
BMC Med Res Methodol. 2025-5-10
J Med Internet Res. 2025-3-11
J Am Med Inform Assoc. 2025-5-1
JAMA Netw Open. 2024-7-1
J Am Med Inform Assoc. 2024-10-1
Cancer. 2025-8-15
J Am Med Inform Assoc. 2025-6-1