评估大型语言模型在摘要筛选中的有效性：一项对比分析。

Evaluating the effectiveness of large language models in abstract screening: a comparative analysis.

机构信息

Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.

Department of Mathematics and Statistics, University of North Carolina at Greensboro, Greensboro, NC, 27402, USA.

出版信息

Syst Rev. 2024 Aug 21;13(1):219. doi: 10.1186/s13643-024-02609-x.

DOI:10.1186/s13643-024-02609-x

PMID:39169386

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11337893/

Abstract

OBJECTIVE

This study aimed to evaluate the performance of large language models (LLMs) in the task of abstract screening in systematic review and meta-analysis studies, exploring their effectiveness, efficiency, and potential integration into existing human expert-based workflows.

METHODS

We developed automation scripts in Python to interact with the APIs of several LLM tools, including ChatGPT v4.0, ChatGPT v3.5, Google PaLM 2, and Meta Llama 2, and latest tools including ChatGPT v4.0 turbo, ChatGPT v3.5 turbo, Google Gemini 1.0 pro, Meta Llama 3, and Claude 3. This study focused on three databases of abstracts and used them as benchmarks to evaluate the performance of these LLM tools in terms of sensitivity, specificity, and overall accuracy. The results of the LLM tools were compared to human-curated inclusion decisions, gold standard for systematic review and meta-analysis studies.

RESULTS

Different LLM tools had varying abilities in abstract screening. Chat GPT v4.0 demonstrated remarkable performance, with balanced sensitivity and specificity, and overall accuracy consistently reaching or exceeding 90%, indicating a high potential for LLMs in abstract screening tasks. The study found that LLMs could provide reliable results with minimal human effort and thus serve as a cost-effective and efficient alternative to traditional abstract screening methods.

CONCLUSION

While LLM tools are not yet ready to completely replace human experts in abstract screening, they show great promise in revolutionizing the process. They can serve as autonomous AI reviewers, contribute to collaborative workflows with human experts, and integrate with hybrid approaches to develop custom tools for increased efficiency. As technology continues to advance, LLMs are poised to play an increasingly important role in abstract screening, reshaping the workflow of systematic review and meta-analysis studies.

摘要

目的

本研究旨在评估大型语言模型（LLM）在系统评价和荟萃分析研究的摘要筛选任务中的性能，探索其有效性、效率以及潜在的整合到现有的基于人类专家的工作流程中。

方法

我们使用 Python 开发了自动化脚本，与多个 LLM 工具的 API 进行交互，包括 ChatGPT v4.0、ChatGPT v3.5、Google PaLM 2 和 Meta Lama 2，以及最新的工具包括 ChatGPT v4.0 turbo、ChatGPT v3.5 turbo、Google Gemini 1.0 pro、Meta Lama 3 和 Claude 3。本研究主要关注三个摘要数据库，并将其用作基准，以评估这些 LLM 工具在敏感性、特异性和整体准确性方面的性能。LLM 工具的结果与人工筛选的纳入决策、系统评价和荟萃分析研究的黄金标准进行了比较。

结果

不同的 LLM 工具在摘要筛选方面具有不同的能力。Chat GPT v4.0 表现出出色的性能，具有平衡的敏感性和特异性，整体准确性始终达到或超过 90%，表明 LLM 在摘要筛选任务中具有很高的潜力。研究发现，LLM 可以在最小的人工干预下提供可靠的结果，因此是传统摘要筛选方法的经济高效替代品。

结论

虽然 LLM 工具还没有准备好完全取代人类专家进行摘要筛选，但它们在彻底改变这一过程方面显示出巨大的潜力。它们可以作为自主 AI 审查员，与人类专家的协作工作流程相结合，并整合到混合方法中，开发出提高效率的定制工具。随着技术的不断进步，LLM 有望在摘要筛选中发挥越来越重要的作用，重塑系统评价和荟萃分析研究的工作流程。

相似文献

Evaluating the effectiveness of large language models in abstract screening: a comparative analysis.评估大型语言模型在摘要筛选中的有效性：一项对比分析。

Syst Rev. 2024 Aug 21;13(1):219. doi: 10.1186/s13643-024-02609-x.

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较：大型语言模型、ChatGPT 和未经训练的急诊医生：一项对比研究。

J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力：定性研究

JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.使用大型语言模型对临床综述进行自动化论文筛选：数据分析研究。

J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.

The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.大型语言模型在变革急诊医学中的作用：范围综述

JMIR Med Inform. 2024 May 10;12:e53787. doi: 10.2196/53787.

A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare.ChatGPT及其他对话式大语言模型在医疗保健领域的系统评价

medRxiv. 2024 Apr 27:2024.04.26.24306390. doi: 10.1101/2024.04.26.24306390.

A question-answering framework for automated abstract screening using large language models.基于大语言模型的自动文摘筛选的问答框架。

J Am Med Inform Assoc. 2024 Sep 1;31(9):1939-1952. doi: 10.1093/jamia/ocae166.

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。

Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力：ChatGPT、谷歌巴德和微软必应的比较研究

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

Performance of a Large Language Model in Screening Citations.大语言模型在引文筛选中的表现。

JAMA Netw Open. 2024 Jul 1;7(7):e2420496. doi: 10.1001/jamanetworkopen.2024.20496.

引用本文的文献

Modeling teacher education students' adoption of large language models through an extended technology acceptance framework.通过扩展的技术接受框架对职前教师采用大语言模型的情况进行建模。

Sci Rep. 2025 Sep 1;15(1):32208. doi: 10.1038/s41598-025-03298-9.

A comparative study of screening performance between abstrackr and GPT models: Systematic review and contextual analysis.Abstrackr与GPT模型筛查性能的比较研究：系统评价与情境分析。

BMC Med Inform Decis Mak. 2025 Aug 7;25(1):293. doi: 10.1186/s12911-025-03138-w.

Revolution of AAV in Drug Discovery: From Delivery System to Clinical Application.腺相关病毒在药物研发中的变革：从递送系统到临床应用

J Med Virol. 2025 Jun;97(6):e70447. doi: 10.1002/jmv.70447.

The dawn of a new era: can machine learning and large language models reshape QSP modeling?新时代的曙光：机器学习和大语言模型能否重塑定量系统药理学建模？

J Pharmacokinet Pharmacodyn. 2025 Jun 16;52(4):36. doi: 10.1007/s10928-025-09984-5.

A systematic methodological evaluation of sepsis guidelines: Protocol for quality assessment and consistency of recommendations.脓毒症指南的系统方法学评估：质量评估与建议一致性方案

Acta Anaesthesiol Scand. 2025 Jul;69(6):e70036. doi: 10.1111/aas.70036.

Streamlining systematic reviews with large language models using prompt engineering and retrieval augmented generation.使用提示工程和检索增强生成技术，通过大语言模型简化系统评价。

BMC Med Res Methodol. 2025 May 10;25(1):130. doi: 10.1186/s12874-025-02583-5.

The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review.大语言模型作为文献综述工具的出现：一项大语言模型辅助的系统综述

J Am Med Inform Assoc. 2025 Jun 1;32(6):1071-1086. doi: 10.1093/jamia/ocaf063.

Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis.测试GPT在环境系统证据综合中用于标题和摘要筛选的效用。

Environ Evid. 2025 Apr 23;14(1):7. doi: 10.1186/s13750-025-00360-x.

Large Language Models and Their Applications in Drug Discovery and Development: A Primer.大语言模型及其在药物发现与开发中的应用：入门指南。

Clin Transl Sci. 2025 Apr;18(4):e70205. doi: 10.1111/cts.70205.

GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews.GPT-3.5 Turbo和GPT-4 Turbo在系统评价的标题和摘要筛选中的应用

JMIR Med Inform. 2025 Mar 12;13:e64682. doi: 10.2196/64682.

本文引用的文献

The SAFE procedure: a practical stopping heuristic for active learning-based screening in systematic reviews and meta-analyses.SAFE 程序：一种实用的停止启发式方法，用于基于主动学习的系统评价和荟萃分析中的筛选。

Syst Rev. 2024 Mar 1;13(1):81. doi: 10.1186/s13643-024-02502-7.

Performance of active learning models for screening prioritization in systematic reviews: a simulation study into the Average Time to Discover relevant records.主动学习模型在系统评价筛选优先级中的性能：平均发现相关记录时间的模拟研究。

Syst Rev. 2023 Jun 20;12(1):100. doi: 10.1186/s13643-023-02257-7.

Are ChatGPT and large language models "the answer" to bringing us closer to systematic review automation?ChatGPT 和大型语言模型是实现系统评价自动化的“答案”吗？

Syst Rev. 2023 Apr 29;12(1):72. doi: 10.1186/s13643-023-02243-z.

Risks and Benefits of Large Language Models for the Environment.大型语言模型对环境的风险与益处

Environ Sci Technol. 2023 Mar 7;57(9):3464-3466. doi: 10.1021/acs.est.3c01106. Epub 2023 Feb 23.

The methodological rigour of systematic reviews in environmental health.环境健康系统综述的方法学严谨性。

Crit Rev Toxicol. 2022 Mar;52(3):167-187. doi: 10.1080/10408444.2022.2082917. Epub 2022 Jul 5.

Incidence of and Reasons and Determinants Associated with Retransitioning from Biosimilar Etanercept to Originator Etanercept.从生物类似物依那西普重新过渡到原研依那西普的发生率及相关原因和决定因素。

BioDrugs. 2021 Nov;35(6):765-772. doi: 10.1007/s40259-021-00501-x. Epub 2021 Oct 26.

Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence.基于人工智能的诊断和预后预测模型研究报告指南（TRIPOD-AI）和偏倚风险工具（PROBAST-AI）制定方案。

BMJ Open. 2021 Jul 9;11(7):e048008. doi: 10.1136/bmjopen-2020-048008.

The Wistar-Kyoto rat model of endogenous depression: A tool for exploring treatment resistance with an urgent need to focus on sex differences.内源性抑郁症的 Wistar-Kyoto 大鼠模型：一种探索治疗抵抗的工具，迫切需要关注性别差异。

Prog Neuropsychopharmacol Biol Psychiatry. 2020 Jul 13;101:109908. doi: 10.1016/j.pnpbp.2020.109908. Epub 2020 Mar 4.

Error rates of human reviewers during abstract screening in systematic reviews.系统评价中摘要筛选过程中人工评审员的错误率。

PLoS One. 2020 Jan 14;15(1):e0227742. doi: 10.1371/journal.pone.0227742. eCollection 2020.

Zero-Shot Learning-A Comprehensive Evaluation of the Good, the Bad and the Ugly.零样本学习：好坏丑的全面评估。

IEEE Trans Pattern Anal Mach Intell. 2019 Sep;41(9):2251-2265. doi: 10.1109/TPAMI.2018.2857768. Epub 2018 Jul 19.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验