ChatGPT和Claude在产科系统评价的研究选择中表现如何。

How Well Do ChatGPT and Claude Perform in Study Selection for Systematic Review in Obstetrics.

作者信息

Insuk Suppachai, Boonpattharatthiti Kansak, Booncharoen Chimbun, Chaipitak Panitnan, Rashid Muhammed, Veettil Sajesh K, Lai Nai Ming, Chaiyakunapruk Nathorn, Dhippayom Teerapon

机构信息

Faculty of Pharmaceutical Sciences, Naresuan University, Phitsanulok, Thailand.

The Research Unit of Evidence Synthesis (TRUES), Faculty of Pharmaceutical Sciences, Naresuan University, Phitsanulok, Thailand.

出版信息

J Med Syst. 2025 Sep 4;49(1):110. doi: 10.1007/s10916-025-02246-4.

DOI:10.1007/s10916-025-02246-4

PMID:40906005

Abstract

The use of generative AI in systematic review workflows has gained attention for enhancing study selection efficiency. However, evidence on its screening performance remains inconclusive, and direct comparisons between different generative AI models are still limited. The objective of this study is to evaluate the performance of ChatGPT-4o and Claude 3.5 Sonnet in the study selection process of a systematic review in obstetrics. A literature search was conducted using PubMed, EMBASE, Cochrane CENTRAL, and EBSCO Open Dissertations from inception till February 2024. Titles and abstracts were screened using a structured prompt-based approach, comparing decisions by ChatGPT, Claude and junior researchers with decisions by an experienced researcher serving as the reference standard. For the full-text review, short and long prompt strategies were applied. We reported title/abstract screening and full-text review performances using accuracy, sensitivity (recall), precision, F1-score, and negative predictive value. In the title/abstract screening phase, human researchers demonstrated the highest accuracy (0.9593), followed by Claude (0.9448) and ChatGPT (0.9138). The F1-score was the highest among human researchers (0.3853), followed by Claude (0.3724) and ChatGPT (0.2755). Negative predictive value (NPV) was high across all screeners: ChatGPT (0.9959), Claude (0.9961), and human researchers (0.9924). In the full-text screening phase, ChatGPT with a short prompt achieved the highest accuracy (0.904), highest F1-score (0.90), and NPV of 1.00, surpassing the performance of Claude and human researchers. Generative AI models perform close to human levels in study selection, as evidenced in obstetrics. Further research should explore their integration into evidence synthesis across different fields.

摘要

生成式人工智能在系统评价工作流程中的应用因提高研究筛选效率而受到关注。然而，关于其筛选性能的证据仍然不确定，不同生成式人工智能模型之间的直接比较仍然有限。本研究的目的是评估ChatGPT-4o和Claude 3.5 Sonnet在产科系统评价的研究筛选过程中的性能。使用PubMed、EMBASE、Cochrane CENTRAL和EBSCO Open Dissertations进行文献检索，检索时间从创刊至2024年2月。使用基于结构化提示的方法筛选标题和摘要，将ChatGPT、Claude和初级研究人员的决策与作为参考标准的经验丰富的研究人员的决策进行比较。对于全文评审，应用了短提示和长提示策略。我们使用准确性、敏感性（召回率）、精确性、F1分数和阴性预测值报告了标题/摘要筛选和全文评审的性能。在标题/摘要筛选阶段，人类研究人员的准确性最高（0.9593），其次是Claude（0.9448）和ChatGPT（0.9138）。F1分数在人类研究人员中最高（0.3853），其次是Claude（0.3724）和ChatGPT（0.2755）。所有筛选者的阴性预测值（NPV）都很高：ChatGPT（0.9959）、Claude（0.9961）和人类研究人员（0.9924）。在全文筛选阶段，使用短提示策略的ChatGPT的准确性最高（0.904）、F1分数最高（0.90）且NPV为1.00，超过了Claude和人类研究人员表现。在产科领域，生成式人工智能模型在研究筛选方面的表现接近人类水平，可以作为证据进行进一步探索。进一步的研究应探索将其整合到不同领域的证据综合中去探索。