在公共政策文献综述中使用GPT-4进行标题和摘要筛选：一项可行性研究。

Using GPT-4 for Title and Abstract Screening in a Literature Review of Public Policies: A Feasibility Study.

作者信息

Rubinstein Max, Grant Sean, Griffin Beth Ann, Pessar Seema Choksy, Stein Bradley D

机构信息

RAND Pittsburgh Pennsylvania USA.

University of Oregon Eugene Oregon USA.

出版信息

Cochrane Evid Synth Methods. 2025 May 22;3(3):e70031. doi: 10.1002/cesm.70031. eCollection 2025 May.

DOI:10.1002/cesm.70031

PMID:40656449

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12245011/

Abstract

INTRODUCTION

We describe the first known use of large language models (LLMs) to screen titles and abstracts in a review of public policy literature. Our objective was to assess the percentage of articles GPT-4 recommended for exclusion that should have been included ("false exclusion rate").

METHODS

We used GPT-4 to exclude articles from a database for a literature review of quantitative evaluations of federal and state policies addressing the opioid crisis. We exported our bibliographic database to a CSV file containing titles, abstracts, and keywords and asked GPT-4 to recommend whether to exclude each article. We conducted a preliminary testing of these recommendations using a subset of articles and a final test on a sample of the entire database. We designated a false exclusion rate of 10% as an adequate performance threshold.

RESULTS

GPT-4 recommended excluding 41,742 of the 43,480 articles (96%) containing an abstract. Our preliminary test identified only one false exclusion; our final test identified no false exclusions, yielding an estimated false exclusion rate of 0.00 [0.00, 0.05]. Fewer than 1%-417 of the 41,742 articles-were incorrectly excluded. After manually assessing the eligibility of all remaining articles, we identified 608 of the 1738 articles that GPT-4 did not exclude: 65% of the articles recommended for inclusion should have been excluded.

DISCUSSION/CONCLUSIONS: GPT-4 performed well at recommending articles to exclude from our literature review, resulting in substantial time and cost savings. A key limitation is that we did not use GPT-4 to determine inclusions, nor did our model perform well on this task. However, GPT-4 dramatically reduced the number of articles requiring review. Systematic reviewers should conduct performance evaluations to ensure that an LLM meets a minimally acceptable quality standard before relying on its recommendations.

摘要

引言

我们描述了在一项公共政策文献综述中首次使用大语言模型（LLMs）来筛选标题和摘要的情况。我们的目标是评估GPT-4建议排除但本应纳入的文章的百分比（“错误排除率”）。

方法

我们使用GPT-4从一个数据库中排除文章，以进行关于联邦和州应对阿片类药物危机政策的定量评估的文献综述。我们将文献数据库导出到一个包含标题、摘要和关键词的CSV文件中，并要求GPT-4推荐是否排除每篇文章。我们使用文章子集对这些推荐进行了初步测试，并对整个数据库的样本进行了最终测试。我们将10%的错误排除率指定为一个足够的性能阈值。

结果

GPT-4建议排除43480篇包含摘要的文章中的41742篇（96%）。我们的初步测试仅发现了一例错误排除；我们的最终测试未发现错误排除，估计错误排除率为0.00[0.00, 0.05]。在41742篇文章中，被错误排除的文章不到1%，即417篇。在手动评估所有剩余文章的合格性后，我们在GPT-4未排除的1738篇文章中识别出608篇：GPT-4建议纳入的文章中有65%本应被排除。

讨论/结论：GPT-4在推荐从我们的文献综述中排除的文章方面表现良好，从而节省了大量时间和成本。一个关键限制是我们没有使用GPT-4来确定纳入文章，而且我们的模型在这项任务上表现不佳。然而，GPT-4显著减少了需要审查的文章数量。系统评价者在依赖大语言模型的建议之前，应进行性能评估，以确保其符合最低可接受的质量标准。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce56/12245011/b18388b91d80/CESM-3-e70031-g001.jpg

相似文献

Using GPT-4 for Title and Abstract Screening in a Literature Review of Public Policies: A Feasibility Study.

Cochrane Evid Synth Methods. 2025 May 22;3(3):e70031. doi: 10.1002/cesm.70031. eCollection 2025 May.

What is the value of routinely testing full blood count, electrolytes and urea, and pulmonary function tests before elective surgery in patients with no apparent clinical indication and in subgroups of patients with common comorbidities: a systematic review of the clinical and cost-effective literature.

Health Technol Assess. 2012 Dec;16(50):i-xvi, 1-159. doi: 10.3310/hta16500.

Regional cerebral blood flow single photon emission computed tomography for detection of Frontotemporal dementia in people with suspected dementia.

Cochrane Database Syst Rev. 2015 Jun 23;2015(6):CD010896. doi: 10.1002/14651858.CD010896.pub2.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Magnetic resonance perfusion for differentiating low-grade from high-grade gliomas at first presentation.

Cochrane Database Syst Rev. 2018 Jan 22;1(1):CD011551. doi: 10.1002/14651858.CD011551.pub2.

Doppler trans-thoracic echocardiography for detection of pulmonary hypertension in adults.

Cochrane Database Syst Rev. 2022 May 9;5(5):CD012809. doi: 10.1002/14651858.CD012809.pub2.

Validation of automated paper screening for esophagectomy systematic review using large language models.

PeerJ Comput Sci. 2025 Apr 30;11:e2822. doi: 10.7717/peerj-cs.2822. eCollection 2025.

Education support services for improving school engagement and academic performance of children and adolescents with a chronic health condition.

Cochrane Database Syst Rev. 2023 Feb 8;2(2):CD011538. doi: 10.1002/14651858.CD011538.pub2.

Antidepressants for pain management in adults with chronic pain: a network meta-analysis.

Health Technol Assess. 2024 Oct;28(62):1-155. doi: 10.3310/MKRT2948.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

本文引用的文献

The use of a large language model to create plain language summaries of evidence reviews in healthcare: A feasibility study.

Cochrane Evid Synth Methods. 2024 Feb 4;2(2):e12041. doi: 10.1002/cesm.12041. eCollection 2024 Feb.

Zero- and few-shot prompting of generative large language models provides weak assessment of risk of bias in clinical trials.

Res Synth Methods. 2024 Nov;15(6):988-1000. doi: 10.1002/jrsm.1749. Epub 2024 Aug 23.

Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain.

Syst Rev. 2024 Jun 15;13(1):158. doi: 10.1186/s13643-024-02575-4.

Evaluating the OpenAI's GPT-3.5 Turbo's performance in extracting information from scientific articles on diabetic retinopathy.

Syst Rev. 2024 May 16;13(1):135. doi: 10.1186/s13643-024-02523-2.

Methodological insights into ChatGPT's screening performance in systematic reviews.

BMC Med Res Methodol. 2024 Mar 27;24(1):78. doi: 10.1186/s12874-024-02203-8.

Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.

Res Synth Methods. 2024 Jul;15(4):616-626. doi: 10.1002/jrsm.1715. Epub 2024 Mar 14.

Data extraction for evidence synthesis using a large language model: A proof-of-concept study.

Res Synth Methods. 2024 Jul;15(4):576-589. doi: 10.1002/jrsm.1710. Epub 2024 Mar 3.

Are ChatGPT and large language models "the answer" to bringing us closer to systematic review automation?

Syst Rev. 2023 Apr 29;12(1):72. doi: 10.1186/s13643-023-02243-z.

The state of the science in opioid policy research.

Drug Alcohol Depend. 2020 Sep 1;214:108137. doi: 10.1016/j.drugalcdep.2020.108137. Epub 2020 Jun 27.

The influence of the team in conducting a systematic review.

Syst Rev. 2017 Aug 1;6(1):149. doi: 10.1186/s13643-017-0548-x.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在公共政策文献综述中使用GPT-4进行标题和摘要筛选：一项可行性研究。

Using GPT-4 for Title and Abstract Screening in a Literature Review of Public Policies: A Feasibility Study.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

引言

方法

结果

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献