Nykvist Björn, Macura Biljana, Xylia Maria, Olsson Erik
Stockholm Environment Institute, 115 23, Stockholm, Sweden.
Environmental and Energy Systems Studies, Lund University, 221 00, Lund, Sweden.
Environ Evid. 2025 Apr 23;14(1):7. doi: 10.1186/s13750-025-00360-x.
In this paper we show that OpenAI's Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.
在本文中,我们表明,当用于科学文章的标题和摘要合格性筛选以及在(系统的)文献综述工作流程中时,OpenAI的大语言模型(LLM)GPT表现出色。我们使用与人工筛选员相同的合格标准,对来自一项关于电动汽车充电基础设施需求的系统综述研究的筛选数据(近12000条记录)对GPT进行了评估。我们测试了该模型的3个不同版本,它们的任务是通过给出0到1之间的相关概率来区分相关和不相关内容。对于最新的GPT-4模型(于2023年11月测试),当概率截止值为0.5时,召回率为100%,这意味着没有遗漏任何相关论文,使用此模型进行筛选将节省原本用于人工筛选的50%的时间。试验更高的截止阈值可以节省更多时间。对于GPT-4,选择的阈值使得召回率仍高于95%(可能会遗漏多达5%的相关论文)时,该模型可以节省75%的人工筛选时间。如果自动化技术能够以有效性、准确性和精确性复制人类专家的人工筛选,那么工作和成本的节省将是巨大的。此外,在研究项目开始时就能相当快速地获得一份全面的相关文献清单,其价值难以低估。然而,由于本研究仅评估了在一项系统综述和一个提示下的性能,我们提醒需要进行更多测试和方法开发,并概述了正确评估大语言模型用于合格性筛选的严谨性和有效性的后续步骤。
J Med Internet Res. 2024-1-12
Cochrane Database Syst Rev. 2022-2-1
J Am Med Inform Assoc. 2025-5-1
J Med Internet Res. 2024-5-22
ACS Environ Au. 2024-12-3
Proc Natl Acad Sci U S A. 2025-1-14
Trends Ecol Evol. 2024-6
BMC Med Res Methodol. 2024-3-27
J Med Internet Res. 2024-1-12
Environ Sci Technol. 2023-11-21