• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT和Claude在产科系统评价的研究选择中表现如何。

How Well Do ChatGPT and Claude Perform in Study Selection for Systematic Review in Obstetrics.

作者信息

Insuk Suppachai, Boonpattharatthiti Kansak, Booncharoen Chimbun, Chaipitak Panitnan, Rashid Muhammed, Veettil Sajesh K, Lai Nai Ming, Chaiyakunapruk Nathorn, Dhippayom Teerapon

机构信息

Faculty of Pharmaceutical Sciences, Naresuan University, Phitsanulok, Thailand.

The Research Unit of Evidence Synthesis (TRUES), Faculty of Pharmaceutical Sciences, Naresuan University, Phitsanulok, Thailand.

出版信息

J Med Syst. 2025 Sep 4;49(1):110. doi: 10.1007/s10916-025-02246-4.

DOI:10.1007/s10916-025-02246-4
PMID:40906005
Abstract

The use of generative AI in systematic review workflows has gained attention for enhancing study selection efficiency. However, evidence on its screening performance remains inconclusive, and direct comparisons between different generative AI models are still limited. The objective of this study is to evaluate the performance of ChatGPT-4o and Claude 3.5 Sonnet in the study selection process of a systematic review in obstetrics. A literature search was conducted using PubMed, EMBASE, Cochrane CENTRAL, and EBSCO Open Dissertations from inception till February 2024. Titles and abstracts were screened using a structured prompt-based approach, comparing decisions by ChatGPT, Claude and junior researchers with decisions by an experienced researcher serving as the reference standard. For the full-text review, short and long prompt strategies were applied. We reported title/abstract screening and full-text review performances using accuracy, sensitivity (recall), precision, F1-score, and negative predictive value. In the title/abstract screening phase, human researchers demonstrated the highest accuracy (0.9593), followed by Claude (0.9448) and ChatGPT (0.9138). The F1-score was the highest among human researchers (0.3853), followed by Claude (0.3724) and ChatGPT (0.2755). Negative predictive value (NPV) was high across all screeners: ChatGPT (0.9959), Claude (0.9961), and human researchers (0.9924). In the full-text screening phase, ChatGPT with a short prompt achieved the highest accuracy (0.904), highest F1-score (0.90), and NPV of 1.00, surpassing the performance of Claude and human researchers. Generative AI models perform close to human levels in study selection, as evidenced in obstetrics. Further research should explore their integration into evidence synthesis across different fields.

摘要

生成式人工智能在系统评价工作流程中的应用因提高研究筛选效率而受到关注。然而,关于其筛选性能的证据仍然不确定,不同生成式人工智能模型之间的直接比较仍然有限。本研究的目的是评估ChatGPT-4o和Claude 3.5 Sonnet在产科系统评价的研究筛选过程中的性能。使用PubMed、EMBASE、Cochrane CENTRAL和EBSCO Open Dissertations进行文献检索,检索时间从创刊至2024年2月。使用基于结构化提示的方法筛选标题和摘要,将ChatGPT、Claude和初级研究人员的决策与作为参考标准的经验丰富的研究人员的决策进行比较。对于全文评审,应用了短提示和长提示策略。我们使用准确性、敏感性(召回率)、精确性、F1分数和阴性预测值报告了标题/摘要筛选和全文评审的性能。在标题/摘要筛选阶段,人类研究人员的准确性最高(0.9593),其次是Claude(0.9448)和ChatGPT(0.9138)。F1分数在人类研究人员中最高(0.3853),其次是Claude(0.3724)和ChatGPT(0.2755)。所有筛选者的阴性预测值(NPV)都很高:ChatGPT(0.9959)、Claude(0.9961)和人类研究人员(0.9924)。在全文筛选阶段,使用短提示策略的ChatGPT的准确性最高(0.904)、F1分数最高(0.90)且NPV为1.00,超过了Claude和人类研究人员表现。在产科领域,生成式人工智能模型在研究筛选方面的表现接近人类水平,可以作为证据进行进一步探索。进一步的研究应探索将其整合到不同领域的证据综合中去探索。

相似文献

1
How Well Do ChatGPT and Claude Perform in Study Selection for Systematic Review in Obstetrics.ChatGPT和Claude在产科系统评价的研究选择中表现如何。
J Med Syst. 2025 Sep 4;49(1):110. doi: 10.1007/s10916-025-02246-4.
2
Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.五个解决欧洲放射学文凭(EDiR)基于文本问题的先进聊天机器人:性能和一致性的差异。
Eur Radiol Exp. 2025 Aug 19;9(1):79. doi: 10.1186/s41747-025-00591-0.
3
Information from digital and human sources: A comparison of chatbot and clinician responses to orthodontic questions.来自数字和人工来源的信息:聊天机器人与临床医生对正畸问题回答的比较。
Am J Orthod Dentofacial Orthop. 2025 May 6. doi: 10.1016/j.ajodo.2025.04.008.
4
Large language models for the screening step in systematic reviews in dentistry.用于牙科系统评价筛选步骤的大语言模型。
J Dent. 2025 Sep;160:105877. doi: 10.1016/j.jdent.2025.105877. Epub 2025 Jun 4.
5
Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.使用大型语言模型对临床综述进行自动化论文筛选:数据分析研究。
J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.
6
Using ChatGPT-4 to Create Structured Medical Notes From Audio Recordings of Physician-Patient Encounters: Comparative Study.利用 ChatGPT-4 从医患对话的音频记录中创建结构化的医疗记录:比较研究。
J Med Internet Res. 2024 Apr 22;26:e54419. doi: 10.2196/54419.
7
AI in Medical Questionnaires: Innovations, Diagnosis, and Implications.医学问卷中的人工智能:创新、诊断及影响
J Med Internet Res. 2025 Jun 23;27:e72398. doi: 10.2196/72398.
8
Classifying Patient Complaints Using Artificial Intelligence-Powered Large Language Models: Cross-Sectional Study.使用人工智能驱动的大语言模型对患者投诉进行分类:横断面研究
J Med Internet Res. 2025 Aug 6;27:e74231. doi: 10.2196/74231.
9
Evaluating the Performance of State-of-the-Art Artificial Intelligence Chatbots Based on the WHO Global Guidelines for the Prevention of Surgical Site Infection: Cross-Sectional Study.基于世界卫生组织预防手术部位感染全球指南评估最先进的人工智能聊天机器人的性能:横断面研究
J Med Internet Res. 2025 Jul 31;27:e75567. doi: 10.2196/75567.
10
Evaluating a Customized Version of ChatGPT for Systematic Review Data Extraction in Health Research: Development and Usability Study.评估定制版ChatGPT在健康研究系统评价数据提取中的应用:开发与可用性研究
JMIR Form Res. 2025 Aug 11;9:e68666. doi: 10.2196/68666.

本文引用的文献

1
Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis.幻觉发生率和 ChatGPT 与 Bard 用于系统评价的参考准确性:比较分析。
J Med Internet Res. 2024 May 22;26:e53164. doi: 10.2196/53164.
2
Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.大型语言模型能否在系统评价中取代人类?评估 GPT-4 从多种语言的同行评议文献和灰色文献中进行筛选和提取数据的效果。
Res Synth Methods. 2024 Jul;15(4):616-626. doi: 10.1002/jrsm.1715. Epub 2024 Mar 14.
3
Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.
使用大型语言模型对临床综述进行自动化论文筛选:数据分析研究。
J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.
4
The value of a second reviewer for study selection in systematic reviews.系统评价中研究选择的第二位评审员的价值。
Res Synth Methods. 2019 Dec;10(4):539-545. doi: 10.1002/jrsm.1369. Epub 2019 Jul 18.
5
Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry.利用PROSPERO注册库的数据,分析对医学干预措施进行系统评价所需的时间和人员。
BMJ Open. 2017 Feb 27;7(2):e012545. doi: 10.1136/bmjopen-2016-012545.