Suppr超能文献

变革文献筛选:大语言模型在系统评价中的新兴作用。

Transforming literature screening: The emerging role of large language models in systematic reviews.

作者信息

Delgado-Chaves Fernando M, Jennings Matthew J, Atalaia Antonio, Wolff Justus, Horvath Rita, Mamdouh Zeinab M, Baumbach Jan, Baumbach Linda

机构信息

Institute for Computational Systems Biology, Faculty of Mathematics, Informatics and Natural Sciences, University of Hamburg, Hamburg 22761, Germany.

Center for Motor Neuron Biology and Diseases, Department of Neurology Columbia University, New York, NY 10032.

出版信息

Proc Natl Acad Sci U S A. 2025 Jan 14;122(2):e2411962122. doi: 10.1073/pnas.2411962122. Epub 2025 Jan 6.

Abstract

Systematic reviews (SR) synthesize evidence-based medical literature, but they involve labor-intensive manual article screening. Large language models (LLMs) can select relevant literature, but their quality and efficacy are still being determined compared to humans. We evaluated the overlap between title- and abstract-based selected articles of 18 different LLMs and human-selected articles for three SR. In the three SRs, 185/4,662, 122/1,741, and 45/66 articles have been selected and considered for full-text screening by two independent reviewers. Due to technical variations and the inability of the LLMs to classify all records, the LLM's considered sample sizes were smaller. However, on average, the 18 LLMs classified 4,294 (min 4,130; max 4,329), 1,539 (min 1,449; max 1,574), and 27 (min 22; max 37) of the titles and abstracts correctly as either included or excluded for the three SRs, respectively. Additional analysis revealed that the definitions of the inclusion criteria and conceptual designs significantly influenced the LLM performances. In conclusion, LLMs can reduce one reviewer´s workload between 33% and 93% during title and abstract screening. However, the exact formulation of the inclusion and exclusion criteria should be refined beforehand for ideal support of the LLMs.

摘要

系统评价(SR)综合基于证据的医学文献,但它们涉及劳动强度大的人工文章筛选。大语言模型(LLM)可以选择相关文献,但与人类相比,其质量和效果仍有待确定。我们评估了18种不同的大语言模型基于标题和摘要选择的文章与人类选择的文章在三项系统评价中的重叠情况。在这三项系统评价中,185/4662、122/1741和45/66篇文章已被两名独立评审员选出并考虑进行全文筛选。由于技术差异以及大语言模型无法对所有记录进行分类,大语言模型考虑的样本量较小。然而,平均而言,这18种大语言模型分别将三项系统评价中4294篇(最小值4130;最大值4329)、1539篇(最小值1449;最大值1574)和27篇(最小值22;最大值37)的标题和摘要正确分类为纳入或排除。进一步分析表明,纳入标准的定义和概念设计对大语言模型的性能有显著影响。总之,在标题和摘要筛选过程中,大语言模型可以将一名评审员的工作量减少33%至93%。然而,为了大语言模型提供理想的支持,应事先完善纳入和排除标准的确切表述。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/66af/11745399/93ff252dc095/pnas.2411962122fig01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验