Rubinstein Max, Grant Sean, Griffin Beth Ann, Pessar Seema Choksy, Stein Bradley D
RAND Pittsburgh Pennsylvania USA.
University of Oregon Eugene Oregon USA.
Cochrane Evid Synth Methods. 2025 May 22;3(3):e70031. doi: 10.1002/cesm.70031. eCollection 2025 May.
We describe the first known use of large language models (LLMs) to screen titles and abstracts in a review of public policy literature. Our objective was to assess the percentage of articles GPT-4 recommended for exclusion that should have been included ("false exclusion rate").
We used GPT-4 to exclude articles from a database for a literature review of quantitative evaluations of federal and state policies addressing the opioid crisis. We exported our bibliographic database to a CSV file containing titles, abstracts, and keywords and asked GPT-4 to recommend whether to exclude each article. We conducted a preliminary testing of these recommendations using a subset of articles and a final test on a sample of the entire database. We designated a false exclusion rate of 10% as an adequate performance threshold.
GPT-4 recommended excluding 41,742 of the 43,480 articles (96%) containing an abstract. Our preliminary test identified only one false exclusion; our final test identified no false exclusions, yielding an estimated false exclusion rate of 0.00 [0.00, 0.05]. Fewer than 1%-417 of the 41,742 articles-were incorrectly excluded. After manually assessing the eligibility of all remaining articles, we identified 608 of the 1738 articles that GPT-4 did not exclude: 65% of the articles recommended for inclusion should have been excluded.
DISCUSSION/CONCLUSIONS: GPT-4 performed well at recommending articles to exclude from our literature review, resulting in substantial time and cost savings. A key limitation is that we did not use GPT-4 to determine inclusions, nor did our model perform well on this task. However, GPT-4 dramatically reduced the number of articles requiring review. Systematic reviewers should conduct performance evaluations to ensure that an LLM meets a minimally acceptable quality standard before relying on its recommendations.
我们描述了在一项公共政策文献综述中首次使用大语言模型(LLMs)来筛选标题和摘要的情况。我们的目标是评估GPT-4建议排除但本应纳入的文章的百分比(“错误排除率”)。
我们使用GPT-4从一个数据库中排除文章,以进行关于联邦和州应对阿片类药物危机政策的定量评估的文献综述。我们将文献数据库导出到一个包含标题、摘要和关键词的CSV文件中,并要求GPT-4推荐是否排除每篇文章。我们使用文章子集对这些推荐进行了初步测试,并对整个数据库的样本进行了最终测试。我们将10%的错误排除率指定为一个足够的性能阈值。
GPT-4建议排除43480篇包含摘要的文章中的41742篇(96%)。我们的初步测试仅发现了一例错误排除;我们的最终测试未发现错误排除,估计错误排除率为0.00[0.00, 0.05]。在41742篇文章中,被错误排除的文章不到1%,即417篇。在手动评估所有剩余文章的合格性后,我们在GPT-4未排除的1738篇文章中识别出608篇:GPT-4建议纳入的文章中有65%本应被排除。
讨论/结论:GPT-4在推荐从我们的文献综述中排除的文章方面表现良好,从而节省了大量时间和成本。一个关键限制是我们没有使用GPT-4来确定纳入文章,而且我们的模型在这项任务上表现不佳。然而,GPT-4显著减少了需要审查的文章数量。系统评价者在依赖大语言模型的建议之前,应进行性能评估,以确保其符合最低可接受的质量标准。