文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

测试GPT在环境系统证据综合中用于标题和摘要筛选的效用。

Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis.

作者信息

Nykvist Björn, Macura Biljana, Xylia Maria, Olsson Erik

机构信息

Stockholm Environment Institute, 115 23, Stockholm, Sweden.

Environmental and Energy Systems Studies, Lund University, 221 00, Lund, Sweden.

出版信息

Environ Evid. 2025 Apr 23;14(1):7. doi: 10.1186/s13750-025-00360-x.


DOI:10.1186/s13750-025-00360-x
PMID:40270055
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12016299/
Abstract

In this paper we show that OpenAI's Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.

摘要

在本文中,我们表明,当用于科学文章的标题和摘要合格性筛选以及在(系统的)文献综述工作流程中时,OpenAI的大语言模型(LLM)GPT表现出色。我们使用与人工筛选员相同的合格标准,对来自一项关于电动汽车充电基础设施需求的系统综述研究的筛选数据(近12000条记录)对GPT进行了评估。我们测试了该模型的3个不同版本,它们的任务是通过给出0到1之间的相关概率来区分相关和不相关内容。对于最新的GPT-4模型(于2023年11月测试),当概率截止值为0.5时,召回率为100%,这意味着没有遗漏任何相关论文,使用此模型进行筛选将节省原本用于人工筛选的50%的时间。试验更高的截止阈值可以节省更多时间。对于GPT-4,选择的阈值使得召回率仍高于95%(可能会遗漏多达5%的相关论文)时,该模型可以节省75%的人工筛选时间。如果自动化技术能够以有效性、准确性和精确性复制人类专家的人工筛选,那么工作和成本的节省将是巨大的。此外,在研究项目开始时就能相当快速地获得一份全面的相关文献清单,其价值难以低估。然而,由于本研究仅评估了在一项系统综述和一个提示下的性能,我们提醒需要进行更多测试和方法开发,并概述了正确评估大语言模型用于合格性筛选的严谨性和有效性的后续步骤。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64d/12016299/72c5f87e9e89/13750_2025_360_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64d/12016299/d1df932a8160/13750_2025_360_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64d/12016299/72c5f87e9e89/13750_2025_360_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64d/12016299/d1df932a8160/13750_2025_360_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e64d/12016299/72c5f87e9e89/13750_2025_360_Fig2_HTML.jpg

相似文献

[1]
Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis.

Environ Evid. 2025-4-23

[2]
Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.

J Med Internet Res. 2024-1-12

[3]
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022-2-1

[4]
Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review.

J Clin Epidemiol. 2025-5

[5]
Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews.

J Med Internet Res. 2024-8-16

[6]
High-performance automated abstract screening with large language model ensembles.

J Am Med Inform Assoc. 2025-5-1

[7]
Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.

Res Synth Methods. 2024-7

[8]
Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis.

J Med Internet Res. 2024-5-22

[9]
Utilizing Large Language Models for Enhanced Clinical Trial Matching: A Study on Automation in Patient Screening.

Cureus. 2024-5-10

[10]
Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses.

Ann Intern Med. 2024-6

本文引用的文献

[1]
Evaluating GPT Models for Automated Literature Screening in Wastewater-Based Epidemiology.

ACS Environ Au. 2024-12-3

[2]
Transforming literature screening: The emerging role of large language models in systematic reviews.

Proc Natl Acad Sci U S A. 2025-1-14

[3]
Evaluating the effectiveness of large language models in abstract screening: a comparative analysis.

Syst Rev. 2024-8-21

[4]
Leveraging AI to improve evidence synthesis in conservation.

Trends Ecol Evol. 2024-6

[5]
Methodological insights into ChatGPT's screening performance in systematic reviews.

BMC Med Res Methodol. 2024-3-27

[6]
Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.

J Med Internet Res. 2024-1-12

[7]
Evaluation of a prototype machine learning tool to semi-automate data extraction for systematic literature reviews.

Syst Rev. 2023-10-6

[8]
Performance of active learning models for screening prioritization in systematic reviews: a simulation study into the Average Time to Discover relevant records.

Syst Rev. 2023-6-20

[9]
Are ChatGPT and large language models "the answer" to bringing us closer to systematic review automation?

Syst Rev. 2023-4-29

[10]
ChatGPT and Environmental Research.

Environ Sci Technol. 2023-11-21

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索