使用大型语言模型对临床综述进行自动化论文筛选：数据分析研究。

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.

机构信息

Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.

Temerty Faculty of Medicine, University of Toronto, Toronto, AB, Canada.

出版信息

J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.

DOI:10.2196/48996

PMID:38214966

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10818236/

Abstract

BACKGROUND

The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subsequent health care decisions. Traditional methods rely heavily on human reviewers, often requiring a significant investment of time and resources.

OBJECTIVE

This study aims to assess the performance of the OpenAI generative pretrained transformer (GPT) and GPT-4 application programming interfaces (APIs) in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review data sets and comparing their performance against ground truth labeling by 2 independent human reviewers.

METHODS

We introduce a novel workflow using the Chat GPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language and a corpus of title and abstract data sets filtered by a minimum of 2 human reviewers. We compared the performance of our model against human-reviewed papers across 6 review papers, screening over 24,000 titles and abstracts.

RESULTS

Our results show an accuracy of 0.91, a macro F-score of 0.60, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The interrater variability between 2 independent human screeners was κ=0.46, and the prevalence and bias-adjusted κ between our proposed methods and the consensus-based human decisions was κ=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for their decisions and corrected their initial decisions upon being asked to explain their reasoning for incorrect classifications.

CONCLUSIONS

Large language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research.

摘要

背景

系统评价临床研究论文是一个劳动密集型和耗时的过程，通常涉及筛选数千个标题和摘要。这个过程的准确性和效率对综述的质量和随后的医疗保健决策至关重要。传统的方法主要依赖于人工评审员，通常需要大量的时间和资源投入。

目的

本研究旨在评估 OpenAI 生成式预训练转换器（GPT）和 GPT-4 应用程序编程接口（API）在准确高效地从真实临床综述数据集识别相关标题和摘要方面的性能，并将其与 2 名独立人工评审员的真实标签进行比较。

方法

我们引入了一种使用 ChatGPT 和 GPT-4 API 筛选临床综述标题和摘要的新工作流程。创建了一个 Python 脚本，使用自然语言向 API 发出调用，并使用至少 2 名人工评审员筛选标题和摘要数据集。我们将我们的模型与 6 篇综述论文中的人工评审论文进行了比较，共筛选了超过 24000 个标题和摘要。

结果

我们的结果显示，准确性为 0.91，宏 F 分数为 0.60，排除论文的敏感性为 0.91，纳入论文的敏感性为 0.76。2 名独立人工筛查者之间的组内变异系数为 κ=0.46，我们提出的方法与基于共识的人工决策之间的流行率和偏倚调整 κ 为 κ=0.96。在随机选择的论文子集上，GPT 模型展示了提供决策推理的能力，并在被要求解释其错误分类的推理时纠正了最初的决策。

结论

大型语言模型有可能简化临床审查过程，为研究人员节省宝贵的时间和精力，并提高临床审查的整体质量。通过优先考虑工作流程，并作为研究人员和评审员的辅助工具，而不是替代工具，GPT-4 等模型可以提高效率，并在医学研究中得出更准确、更可靠的结论。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b37/10818236/9e2f72409d22/jmir_v26i1e48996_fig1.jpg

相似文献

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.

J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.

Validation of automated paper screening for esophagectomy systematic review using large language models.

PeerJ Comput Sci. 2025 Apr 30;11:e2822. doi: 10.7717/peerj-cs.2822. eCollection 2025.

Generative pretrained transformer models can function as highly reliable second screeners of titles and abstracts in systematic reviews: A proof of concept and common guidelines.

Psychol Methods. 2025 Jul 10. doi: 10.1037/met0000769.

Enhancing systematic literature reviews with generative artificial intelligence: development, applications, and performance evaluation.

J Am Med Inform Assoc. 2025 Apr 1;32(4):616-625. doi: 10.1093/jamia/ocaf030.

A rapid and systematic review of the clinical effectiveness and cost-effectiveness of topotecan for ovarian cancer.

Health Technol Assess. 2001;5(28):1-110. doi: 10.3310/hta5280.

Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection.

JMIR Form Res. 2025 Feb 11;9:e60095. doi: 10.2196/60095.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Technological aids for the rehabilitation of memory and executive functioning in children and adolescents with acquired brain injury.

Cochrane Database Syst Rev. 2016 Jul 1;7(7):CD011020. doi: 10.1002/14651858.CD011020.pub2.

Development of a GPT-4-Powered Virtual Simulated Patient and Communication Training Platform for Medical Students to Practice Discussing Abnormal Mammogram Results With Patients: Multiphase Study.

JMIR Form Res. 2025 Apr 17;9:e65670. doi: 10.2196/65670.

Single induction dose of etomidate versus other induction agents for endotracheal intubation in critically ill patients.

Cochrane Database Syst Rev. 2015 Jan 8;1(1):CD010225. doi: 10.1002/14651858.CD010225.pub2.

引用本文的文献

ASReview LAB v.2: Open-source text screening with multiple agents and a crowd of experts.

Patterns (N Y). 2025 Jul 3;6(7):101318. doi: 10.1016/j.patter.2025.101318. eCollection 2025 Jul 11.

How Well Do ChatGPT and Claude Perform in Study Selection for Systematic Review in Obstetrics.

J Med Syst. 2025 Sep 4;49(1):110. doi: 10.1007/s10916-025-02246-4.

Using artificial intelligence for the development of a living evidence map: The pharmacopuncture example.

Integr Med Res. 2025 Dec;14(4):101217. doi: 10.1016/j.imr.2025.101217. Epub 2025 Aug 6.

Science diplomacy: A global research field? Findings from a bibliometric analysis of the science diplomacy scholarship of the past twenty years.

Scientometrics. 2025;130(8):4697-4722. doi: 10.1007/s11192-025-05396-x. Epub 2025 Aug 8.

Artificial intelligence for the science of evidence synthesis: how good are AI-powered tools for automatic literature screening?

BMC Med Res Methodol. 2025 Aug 25;25(1):199. doi: 10.1186/s12874-025-02644-9.

A comparative study of screening performance between abstrackr and GPT models: Systematic review and contextual analysis.

BMC Med Inform Decis Mak. 2025 Aug 7;25(1):293. doi: 10.1186/s12911-025-03138-w.

Development and evaluation of prompts for a large language model to screen titles and abstracts in a living systematic review.

BMJ Ment Health. 2025 Jul 22;28(1):e301762. doi: 10.1136/bmjment-2025-301762.

Validation of automated paper screening for esophagectomy systematic review using large language models.

PeerJ Comput Sci. 2025 Apr 30;11:e2822. doi: 10.7717/peerj-cs.2822. eCollection 2025.

Streamlining systematic reviews with large language models using prompt engineering and retrieval augmented generation.

BMC Med Res Methodol. 2025 May 10;25(1):130. doi: 10.1186/s12874-025-02583-5.

The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review.

J Am Med Inform Assoc. 2025 Jun 1;32(6):1071-1086. doi: 10.1093/jamia/ocaf063.

本文引用的文献

Comparative Efficacy of Adjuvant Nonopioid Analgesia in Adult Cardiac Surgical Patients: A Network Meta-Analysis.

J Cardiothorac Vasc Anesth. 2023 Jul;37(7):1169-1178. doi: 10.1053/j.jvca.2023.03.018. Epub 2023 Mar 18.

Efficacy and safety of selective serotonin reuptake inhibitors in COVID-19 management: a systematic review and meta-analysis.

Clin Microbiol Infect. 2023 May;29(5):578-586. doi: 10.1016/j.cmi.2023.01.010. Epub 2023 Jan 16.

In a pilot study, automated real-time systematic review updates were feasible, accurate, and work-saving.

J Clin Epidemiol. 2023 Jan;153:26-33. doi: 10.1016/j.jclinepi.2022.08.013. Epub 2022 Sep 20.

The use of acupuncture in patients with Raynaud's syndrome: a systematic review and meta-analysis of randomized controlled trials.

Acupunct Med. 2023 Apr;41(2):63-72. doi: 10.1177/09645284221076504. Epub 2022 May 24.

Using artificial intelligence methods for systematic review in health sciences: A systematic review.

Res Synth Methods. 2022 May;13(3):353-362. doi: 10.1002/jrsm.1553. Epub 2022 Feb 28.

Efficacy of lopinavir-ritonavir combination therapy for the treatment of hospitalized COVID-19 patients: a meta-analysis.

Future Virol. 2021 Dec. doi: 10.2217/fvl-2021-0066. Epub 2022 Jan 31.

Automation of literature screening using machine learning in medical evidence synthesis: a diagnostic test accuracy systematic review protocol.

Syst Rev. 2022 Jan 15;11(1):11. doi: 10.1186/s13643-021-01881-5.

Efficacy and safety of ivermectin for the treatment of COVID-19: a systematic review and meta-analysis.

QJM. 2021 Dec 20;114(10):721-732. doi: 10.1093/qjmed/hcab247.

: Mapping and Browsing Medical Evidence in Real-Time.

Proc Conf. 2020 Jul;2020:63-69. doi: 10.18653/v1/2020.acl-demos.9.

The Impact of Systematic Review Automation Tools on Methodological Quality and Time Taken to Complete Systematic Review Tasks: Case Study.

JMIR Med Educ. 2021 May 31;7(2):e24418. doi: 10.2196/24418.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用大型语言模型对临床综述进行自动化论文筛选：数据分析研究。

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献