文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

用于系统评价中大型语言模型驱动筛查的提示模板开发

Development of Prompt Templates for Large Language Model-Driven Screening in Systematic Reviews.

作者信息

Cao Christian, Sang Jason, Arora Rohit, Chen David, Kloosterman Robert, Cecere Matthew, Gorla Jaswanth, Saleh Richard, Drennan Ian, Teja Bijan, Fehlings Michael, Ronksley Paul, Leung Alexander A, Weisz Dany E, Ware Harriet, Whelan Mairead, Emerson David B, Arora Rahul K, Bobrovitz Niklas

机构信息

Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, and Centre for Health Informatics, Department of Community Health Sciences, University of Calgary, Calgary, Alberta, Canada (C.C.).

Stripe, San Francisco, California (J.S.).

出版信息

Ann Intern Med. 2025 Mar;178(3):389-401. doi: 10.7326/ANNALS-24-02189. Epub 2025 Feb 25.


DOI:10.7326/ANNALS-24-02189
PMID:39993313
Abstract

BACKGROUND: Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis. OBJECTIVE: To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews. DESIGN: Diagnostic test accuracy. SETTING: 48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI). PARTICIPANTS: None. MEASUREMENTS: Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity). RESULTS: Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD. LIMITATIONS: Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles. CONCLUSION: A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences. PRIMARY FUNDING SOURCE: None.

摘要

背景:系统评价(SRs)受到初始严格文章筛选的阻碍,这延迟了获取可靠信息综合的时间。 目的:开发适用于不同评价的、由大语言模型(LLM)驱动的摘要和全文筛选通用提示模板。 设计:诊断试验准确性研究。 设置:对48425条引文进行了10项系统评价的摘要筛选测试。全文筛选评估了原始搜索中所有12690篇可免费获取的文章。提示开发使用GPT4-0125-preview模型(OpenAI)。 参与者:无。 测量:根据系统评价纳入标准,提示大语言模型纳入或排除文章。在全文筛选后,将模型输出与原始系统评价作者的决定进行比较,以评估性能(准确性、敏感性和特异性)。 结果:使用GPT4-0125-preview优化后的提示在10项系统评价的摘要筛选中加权敏感性为97.7%(范围86.7%至100%),特异性为85.2%(范围68.3%至95.9%);在全文筛选中加权敏感性为96.5%(范围89.7%至100.0%),特异性为91.2%(范围8~7%至100%)。相比之下,零样本提示的敏感性较差(摘要筛选为49.0%,全文筛选为49.1%)。在所有大语言模型中,Claude-3.5(Anthropic)和GPT4变体性能相似,而Gemini Pro(谷歌)和GPT3.5(OpenAI)模型表现较差。10000条引文的直接筛选成本差异很大:单人进行摘要筛选估计需要超过83小时,费用为1666.67美元,而我们基于大语言模型的方法在不到1天的时间内完成筛选,费用为157.02美元。 局限性:可能存在进一步优化提示的方法。回顾性研究。系统评价的便利样本。全文筛选评估仅限于免费的PubMed Central全文文章。 结论:开发了一种适用于摘要和全文筛选且具有高敏感性和特异性的通用提示,可应用于其他系统评价和大语言模型。我们的提示创新可能对系统评价研究者以及医学领域进行类似基于标准任务的研究人员具有价值。 主要资金来源:无。

相似文献

[1]
Development of Prompt Templates for Large Language Model-Driven Screening in Systematic Reviews.

Ann Intern Med. 2025-3

[2]
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022-2-1

[3]
Streamlining systematic reviews with large language models using prompt engineering and retrieval augmented generation.

BMC Med Res Methodol. 2025-5-10

[4]
Accuracy of Large Language Models for Literature Screening in Thoracic Surgery: Diagnostic Study.

J Med Internet Res. 2025-3-11

[5]
High-performance automated abstract screening with large language model ensembles.

J Am Med Inform Assoc. 2025-5-1

[6]
Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses.

Ann Intern Med. 2024-6

[7]
Performance of a Large Language Model in Screening Citations.

JAMA Netw Open. 2024-7-1

[8]
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

JMIR Med Inform. 2024-4-8

[9]
Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review.

J Clin Epidemiol. 2025-5

[10]
Evaluating large language models for health-related text classification tasks with public social media data.

J Am Med Inform Assoc. 2024-10-1

引用本文的文献

[1]
Artificial intelligence across the cancer care continuum.

Cancer. 2025-8-15

[2]
Development and evaluation of prompts for a large language model to screen titles and abstracts in a living systematic review.

BMJ Ment Health. 2025-7-22

[3]
Artificial intelligence will change the research environment in dental medicine dramatically: will algorithms replace literature reviews in the near future?

Dentomaxillofac Radiol. 2025-7-1

[4]
The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review.

J Am Med Inform Assoc. 2025-6-1

[5]
Large Language Model-Supported Systematic Reviews to Augment Clinical Guideline Development: An American Gastroenterological Association Pilot.

Gastroenterology. 2025-9

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索