Suppr
超能文献

变革文献筛选：大语言模型在系统评价中的新兴作用。

Transforming literature screening: The emerging role of large language models in systematic reviews.

作者信息

Delgado-Chaves Fernando M, Jennings Matthew J, Atalaia Antonio, Wolff Justus, Horvath Rita, Mamdouh Zeinab M, Baumbach Jan, Baumbach Linda

机构信息

Institute for Computational Systems Biology, Faculty of Mathematics, Informatics and Natural Sciences, University of Hamburg, Hamburg 22761, Germany.

Center for Motor Neuron Biology and Diseases, Department of Neurology Columbia University, New York, NY 10032.

出版信息

Proc Natl Acad Sci U S A. 2025 Jan 14;122(2):e2411962122. doi: 10.1073/pnas.2411962122. Epub 2025 Jan 6.

DOI:10.1073/pnas.2411962122

PMID:39761403

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11745399/

Abstract

Systematic reviews (SR) synthesize evidence-based medical literature, but they involve labor-intensive manual article screening. Large language models (LLMs) can select relevant literature, but their quality and efficacy are still being determined compared to humans. We evaluated the overlap between title- and abstract-based selected articles of 18 different LLMs and human-selected articles for three SR. In the three SRs, 185/4,662, 122/1,741, and 45/66 articles have been selected and considered for full-text screening by two independent reviewers. Due to technical variations and the inability of the LLMs to classify all records, the LLM's considered sample sizes were smaller. However, on average, the 18 LLMs classified 4,294 (min 4,130; max 4,329), 1,539 (min 1,449; max 1,574), and 27 (min 22; max 37) of the titles and abstracts correctly as either included or excluded for the three SRs, respectively. Additional analysis revealed that the definitions of the inclusion criteria and conceptual designs significantly influenced the LLM performances. In conclusion, LLMs can reduce one reviewer´s workload between 33% and 93% during title and abstract screening. However, the exact formulation of the inclusion and exclusion criteria should be refined beforehand for ideal support of the LLMs.

摘要

系统评价（SR）综合基于证据的医学文献，但它们涉及劳动强度大的人工文章筛选。大语言模型（LLM）可以选择相关文献，但与人类相比，其质量和效果仍有待确定。我们评估了18种不同的大语言模型基于标题和摘要选择的文章与人类选择的文章在三项系统评价中的重叠情况。在这三项系统评价中，185/4662、122/1741和45/66篇文章已被两名独立评审员选出并考虑进行全文筛选。由于技术差异以及大语言模型无法对所有记录进行分类，大语言模型考虑的样本量较小。然而，平均而言，这18种大语言模型分别将三项系统评价中4294篇（最小值4130；最大值4329）、1539篇（最小值1449；最大值1574）和27篇（最小值22；最大值37）的标题和摘要正确分类为纳入或排除。进一步分析表明，纳入标准的定义和概念设计对大语言模型的性能有显著影响。总之，在标题和摘要筛选过程中，大语言模型可以将一名评审员的工作量减少33%至93%。然而，为了大语言模型提供理想的支持，应事先完善纳入和排除标准的确切表述。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/66af/11745399/93ff252dc095/pnas.2411962122fig01.jpg

相似文献

Transforming literature screening: The emerging role of large language models in systematic reviews.

Proc Natl Acad Sci U S A. 2025 Jan 14;122(2):e2411962122. doi: 10.1073/pnas.2411962122. Epub 2025 Jan 6.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review.

J Clin Epidemiol. 2025 May;181:111746. doi: 10.1016/j.jclinepi.2025.111746. Epub 2025 Feb 26.

Development of Prompt Templates for Large Language Model-Driven Screening in Systematic Reviews.

Ann Intern Med. 2025 Mar;178(3):389-401. doi: 10.7326/ANNALS-24-02189. Epub 2025 Feb 25.

Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain.

Syst Rev. 2024 Jun 15;13(1):158. doi: 10.1186/s13643-024-02575-4.

Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions.

Cancers (Basel). 2024 Aug 12;16(16):2830. doi: 10.3390/cancers16162830.

Accuracy of Large Language Models for Literature Screening in Thoracic Surgery: Diagnostic Study.

J Med Internet Res. 2025 Mar 11;27:e67488. doi: 10.2196/67488.

Improving Systematic Review Updates With Natural Language Processing Through Abstract Component Classification and Selection: Algorithm Development and Validation.

JMIR Med Inform. 2025 Mar 27;13:e65371. doi: 10.2196/65371.

High-performance automated abstract screening with large language model ensembles.

J Am Med Inform Assoc. 2025 May 1;32(5):893-904. doi: 10.1093/jamia/ocaf050.

The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.

JMIR Med Inform. 2024 May 10;12:e53787. doi: 10.2196/53787.

引用本文的文献

Artificial intelligence for the science of evidence synthesis: how good are AI-powered tools for automatic literature screening?

BMC Med Res Methodol. 2025 Aug 25;25(1):199. doi: 10.1186/s12874-025-02644-9.

Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis.

Environ Evid. 2025 Apr 23;14(1):7. doi: 10.1186/s13750-025-00360-x.

本文引用的文献

Cost-Effectiveness of Treatments for Musculoskeletal Conditions Offered by Physiotherapists: A Systematic Review of Trial-Based Evaluations.

Sports Med Open. 2024 Apr 13;10(1):38. doi: 10.1186/s40798-024-00713-9.

Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.

Res Synth Methods. 2024 Jul;15(4):616-626. doi: 10.1002/jrsm.1715. Epub 2024 Mar 14.

Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment.

BMJ Evid Based Med. 2024 Nov 22;29(6):394-398. doi: 10.1136/bmjebm-2023-112597.

Transforming clinical trials: the emerging roles of large language models.

Transl Clin Pharmacol. 2023 Sep;31(3):131-138. doi: 10.12793/tcp.2023.31.e16. Epub 2023 Sep 19.

The global prevalence of overweight and obesity among nurses: A systematic review and meta-analyses.

J Clin Nurs. 2023 Dec;32(23-24):7934-7955. doi: 10.1111/jocn.16861. Epub 2023 Sep 29.

Streamlining Systematic Reviews: Harnessing Large Language Models for Quality Assessment and Risk-of-Bias Evaluation.

Cureus. 2023 Aug 6;15(8):e43023. doi: 10.7759/cureus.43023. eCollection 2023 Aug.

Large language models in medicine.

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

A semiparametric approach for meta-analysis of diagnostic accuracy studies with multiple cut-offs.

Res Synth Methods. 2022 Sep;13(5):612-621. doi: 10.1002/jrsm.1579. Epub 2022 Jun 24.

Economic evaluations of musculoskeletal physiotherapy: protocol of a systematic review.

BMJ Open. 2022 Feb 15;12(2):e058143. doi: 10.1136/bmjopen-2021-058143.

Keeping Up With the Medical Literature: Why, How, and When?

Stroke. 2021 Nov;52(11):e746-e748. doi: 10.1161/STROKEAHA.121.036141. Epub 2021 Oct 8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

变革文献筛选：大语言模型在系统评价中的新兴作用。

Transforming literature screening: The emerging role of large language models in systematic reviews.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译