使用大语言模型集成进行高性能自动摘要筛选。

High-performance automated abstract screening with large language model ensembles.

作者信息

Sanghera Rohan, Thirunavukarasu Arun James, El Khoury Marc, O'Logbon Jessica, Chen Yuqing, Watt Archie, Mahmood Mustafa, Butt Hamid, Nishimura George, Soltan Andrew A S

机构信息

Oxford University Hospitals NHS Foundation Trust, Oxford OX3 9DU, United Kingdom.

Oxford University Clinical Academic Graduate School, Medical Sciences Division, University of Oxford, Oxford OX3 9DU, United Kingdom.

出版信息

J Am Med Inform Assoc. 2025 May 1;32(5):893-904. doi: 10.1093/jamia/ocaf050.

DOI:10.1093/jamia/ocaf050

PMID:40119675

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12012331/

Abstract

OBJECTIVE

screening is a labor-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies. We aimed to validate large language models (LLMs) used to automate abstract screening.

MATERIALS AND METHODS

LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialed across 23 Cochrane Library systematic reviews to evaluate their accuracy in zero-shot binary classification for abstract screening. Initial evaluation on a balanced development dataset (n = 800) identified optimal prompting strategies, and the best performing LLM-prompt combinations were then validated on a comprehensive dataset of replicated search results (n = 119 695).

RESULTS

On the development dataset, LLMs exhibited superior performance to human researchers in terms of sensitivity (LLMmax = 1.000, humanmax = 0.775), precision (LLMmax = 0.927, humanmax = 0.911), and balanced accuracy (LLMmax = 0.904, humanmax = 0.865). When evaluated on the comprehensive dataset, the best performing LLM-prompt combinations exhibited consistent sensitivity (range 0.756-1.000) but diminished precision (range 0.004-0.096) due to class imbalance. In addition, 66 LLM-human and LLM-LLM ensembles exhibited perfect sensitivity with a maximal precision of 0.458 with the development dataset, decreasing to 0.1450 over the comprehensive dataset; but conferring workload reductions ranging between 37.55% and 99.11%.

DISCUSSION

Automated abstract screening can reduce the screening workload in systematic review while maintaining quality. Performance variation between reviews highlights the importance of domain-specific validation before autonomous deployment. LLM-human ensembles can achieve similar benefits while maintaining human oversight over all records.

CONCLUSION

LLMs may reduce the human labor cost of systematic review with maintained or improved accuracy, thereby increasing the efficiency and quality of evidence synthesis.

摘要

目的

筛选是系统评价中一项劳动密集型工作，涉及对大量研究反复应用纳入和排除标准。我们旨在验证用于自动化摘要筛选的大语言模型（LLMs）。

材料与方法

在23项Cochrane图书馆系统评价中对大语言模型（GPT - 3.5 Turbo、GPT - 4 Turbo、GPT - 4o、Llama 3 70B、Gemini 1.5 Pro和Claude Sonnet 3.5）进行测试，以评估它们在摘要筛选的零样本二元分类中的准确性。在平衡的开发数据集（n = 800）上的初步评估确定了最佳提示策略，然后在复制搜索结果的综合数据集（n = 119695）上对表现最佳的大语言模型 - 提示组合进行验证。

结果

在开发数据集上，大语言模型在敏感性（大语言模型最大值 = 1.000，人类最大值 = 0.775）、精确率（大语言模型最大值 = 0.927，人类最大值 = 0.911）和平衡准确率（大语言模型最大值 = 0.904，人类最大值 = 0.865）方面表现优于人类研究人员。在综合数据集上进行评估时，由于类别不平衡，表现最佳的大语言模型 - 提示组合表现出一致的敏感性（范围0.756 - 1.000），但精确率有所下降（范围0.004 - 0.096）。此外，66个大语言模型 - 人类和大语言模型 - 大语言模型集成在开发数据集上表现出完美的敏感性，最大精确率为0.458，在综合数据集上降至0.1450；但可减少37.55%至99.11%的工作量。

讨论

自动化摘要筛选可以在保持质量的同时减少系统评价中的筛选工作量。不同评价之间的性能差异凸显了在自主部署前进行特定领域验证的重要性。大语言模型 - 人类集成可以在保持对所有记录人工监督的同时实现类似的益处。

结论

大语言模型可以在保持或提高准确性的同时降低系统评价的人力成本，从而提高证据综合的效率和质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5eeb/12012331/495321d03c0a/ocaf050f1.jpg

相似文献

High-performance automated abstract screening with large language model ensembles.使用大语言模型集成进行高性能自动摘要筛选。

J Am Med Inform Assoc. 2025 May 1;32(5):893-904. doi: 10.1093/jamia/ocaf050.

Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.使用GPT-4o和Llama-3.3-70B从自由文本中风CT报告中提取数据：注释指南的影响

Eur Radiol Exp. 2025 Jun 19;9(1):61. doi: 10.1186/s41747-025-00600-2.

Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation.用于从非结构化和半结构化电子健康记录中提取数据的大语言模型：多模型性能评估

BMJ Health Care Inform. 2025 Jan 19;32(1):e101139. doi: 10.1136/bmjhci-2024-101139.

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测：基于放射学报告的多中心方法学研究

J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.

Enhancing AI for citation screening in literature reviews: Improving accuracy with ensemble models.在文献综述中增强人工智能用于文献筛选：使用集成模型提高准确性。

Int J Med Inform. 2025 Nov;203:106035. doi: 10.1016/j.ijmedinf.2025.106035. Epub 2025 Jul 1.

Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能：比较研究

J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.

Collaborative large language models for automated data extraction in living systematic reviews.用于活体系统评价中自动数据提取的协作式大语言模型

J Am Med Inform Assoc. 2025 Apr 1;32(4):638-647. doi: 10.1093/jamia/ocae325.

Large Language Model-Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study.使用修订后的偏倚风险工具在随机对照试验中进行大语言模型辅助的偏倚风险评估：可用性研究

J Med Internet Res. 2025 Jun 24;27:e70450. doi: 10.2196/70450.

A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试，采用了适配的大语言模型。

J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.

Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.利用大语言模型对合成及真实世界社交媒体上有关结膜炎爆发的帖子中的流行病学特征进行分类：信息流行病学研究

J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.

引用本文的文献

Artificial intelligence across the cancer care continuum.贯穿癌症护理全过程的人工智能

Cancer. 2025 Aug 15;131(16):e70050. doi: 10.1002/cncr.70050.

Accelerating clinical evidence synthesis with large language models.利用大语言模型加速临床证据综合分析

NPJ Digit Med. 2025 Aug 8;8(1):509. doi: 10.1038/s41746-025-01840-7.

A foundation model for human-AI collaboration in medical literature mining.医学文献挖掘中人类与人工智能协作的基础模型。

ArXiv. 2025 Jan 27:arXiv:2501.16255v1.

Treatment allocation in ophthalmological randomised-control trials (TAO-RCT): A cross-sectional meta-research study.眼科随机对照试验中的治疗分配（TAO-RCT）：一项横断面元研究。

Eye (Lond). 2025 Jul 17. doi: 10.1038/s41433-025-03922-y.

本文引用的文献

Large Language Models for Chatbot Health Advice Studies: A Systematic Review.用于聊天机器人健康建议研究的大语言模型：一项系统综述。

JAMA Netw Open. 2025 Feb 3;8(2):e2457879. doi: 10.1001/jamanetworkopen.2024.57879.

Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews.大型语言模型在通过标题和摘要筛选确定合格研究方面的人类可比敏感性：使用 GPT-3.5 和 GPT-4 进行系统评价的 3 层策略。

J Med Internet Res. 2024 Aug 16;26:e52758. doi: 10.2196/52758.

Performance of a Large Language Model in Screening Citations.大语言模型在引文筛选中的表现。

JAMA Netw Open. 2024 Jul 1;7(7):e2420496. doi: 10.1001/jamanetworkopen.2024.20496.

Clinical performance of automated machine learning: A systematic review.自动化机器学习的临床性能：系统评价。

Ann Acad Med Singap. 2024 Mar 27;53(3):187-207. doi: 10.47102/annals-acadmedsg.2023113.

Potential Roles of Large Language Models in the Production of Systematic Reviews and Meta-Analyses.大语言模型在系统评价和荟萃分析制作中的潜在作用。

J Med Internet Res. 2024 Jun 25;26:e56780. doi: 10.2196/56780.

Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study.大语言模型在眼科领域接近专家级临床知识和推理能力：一项直接比较的横断面研究。

PLOS Digit Health. 2024 Apr 17;3(4):e0000341. doi: 10.1371/journal.pdig.0000341. eCollection 2024 Apr.

Medically assisted hydration for adults receiving palliative care.对接受姑息治疗的成年人进行医学辅助水化。

Cochrane Database Syst Rev. 2023 Dec 14;12(12):CD006273. doi: 10.1002/14651858.CD006273.pub4.

Systematic review search strategies are poorly reported and not reproducible: a cross-sectional metaresearch study.系统评价检索策略报告质量差且不可重复：一项横断面元研究。

J Clin Epidemiol. 2024 Feb;166:111229. doi: 10.1016/j.jclinepi.2023.111229. Epub 2023 Dec 3.

Enhancing title and abstract screening for systematic reviews with GPT-3.5 turbo.使用GPT-3.5 turbo加强系统评价的标题和摘要筛选

BMJ Evid Based Med. 2024 Jan 19;29(1):69-70. doi: 10.1136/bmjebm-2023-112678.

Paracetamol (acetaminophen) or non-steroidal anti-inflammatory drugs, alone or combined, for pain relief in acute otitis media in children.对乙酰氨基酚（扑热息痛）或非甾体抗炎药，单独或联合使用，用于缓解儿童急性中耳炎的疼痛。

Cochrane Database Syst Rev. 2023 Aug 18;8(8):CD011534. doi: 10.1002/14651858.CD011534.pub3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用大语言模型集成进行高性能自动摘要筛选。

High-performance automated abstract screening with large language model ensembles.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料与方法

结果

讨论

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献