• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

PERFORM研究:跨语言和时间限制的妇产科横断面场景中人工智能与住院医师的比较

The PERFORM Study: Artificial Intelligence Versus Human Residents in Cross-Sectional Obstetrics-Gynecology Scenarios Across Languages and Time Constraints.

作者信息

Martinelli Canio, Giordano Antonio, Carnevale Vincenzo, Burk Sharon Raffaella, Porto Lavinia, Vizzielli Giuseppe, Ercoli Alfredo

机构信息

Sbarro Institute for Cancer Research and Molecular Medicine and Center of Biotechnology, College of Science and Technology, Temple University, Philadelphia, PA.

Department of Human Pathology of Adult and Childhood "Gaetano Barresi," Unit of Obstetrics and Gynecology, University of Messina, Messina, Italy.

出版信息

Mayo Clin Proc Digit Health. 2025 Mar 8;3(2):100206. doi: 10.1016/j.mcpdig.2025.100206. eCollection 2025 Jun.

DOI:10.1016/j.mcpdig.2025.100206
PMID:40568610
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12190988/
Abstract

OBJECTIVE

To systematically evaluate the performance of artificial intelligence (AI) large language models (LLMs) compared with obstetrics-gynecology residents in clinical decision-making, examining diagnostic accuracy and error patterns across linguistic domains, time constraints, and experience levels.

PATIENTS AND METHODS

In this cross-sectional study, we evaluated 8 AI LLMs and 24 obstetrics-gynecology residents (Years 1-5) using 60 standardized clinical scenarios. Most AI LLMs and all residents were assessed in May 2024, whereas chat GPT-01-preview, chat-GPT4o, and Claude Sonnet 3.5 were evaluated in November 2024. The assessment framework incorporated English and Italian scenarios under both timed and untimed conditions, along with systematic error pattern analysis. The primary outcome was diagnostic accuracy; secondary end points included AI system stratification, resident progression, language impact, time pressure effects, and integration potential.

RESULTS

The AI LLMs reported superior overall accuracy (73.75%; 95% confidence interval [CI], 69.64%-77.49%) compared with residents (65.35%; 95% CI, 62.85%-67.76%; <.001). High-performing AI systems (ChatGPT-01-preview, GPT4o, and Claude Sonnet 3.5) achieved consistently high cross-linguistic accuracy (88.33%) with minimal language impact (6.67%±0.00%). Resident performance declined significantly under time constraints (from 73.2% to 56.5% adjusted accuracy; Cohen's d=1.009; <.001), whereas AI systems reported lesser deterioration. Error pattern analysis indicated a moderate correlation between AI and human reasoning (r=0.666; <.001). Residents exhibited systematic progression from year 1 (44.7%) to year 5 (87.1%). Integration analysis found variable benefits across training levels, with maximum enhancement in early-career residents (+29.7%; <.001).

CONCLUSION

High-performing AI LLMs reported strong diagnostic accuracy and resilience under linguistic and temporal pressures. These findings suggest that AI-enhanced decision-making may offer particular benefits in obstetrics and gynecology training programs, especially for junior residents, by improving diagnostic consistency and potentially reducing cognitive load in time-sensitive clinical settings.

摘要

目的

系统评估人工智能(AI)大语言模型(LLMs)与妇产科住院医师在临床决策方面的表现,考察跨语言领域、时间限制和经验水平的诊断准确性及错误模式。

患者与方法

在这项横断面研究中,我们使用60个标准化临床场景评估了8个AI大语言模型和24名妇产科住院医师(1至5年级)。大多数AI大语言模型和所有住院医师于2024年5月接受评估,而Chat GPT - 01 - preview、Chat - GPT4o和Claude Sonnet 3.5于2024年11月接受评估。评估框架纳入了有时间限制和无时间限制条件下的英语和意大利语场景,以及系统的错误模式分析。主要结果是诊断准确性;次要终点包括AI系统分层、住院医师进展、语言影响、时间压力效应和整合潜力。

结果

与住院医师(65.35%;95%置信区间[CI],62.85% - 67.76%;P <.001)相比,AI大语言模型总体准确性更高(73.75%;95% CI,69.64% - 77.49%)。高性能AI系统(ChatGPT - 01 - preview、GPT4o和Claude Sonnet 3.5)实现了始终如一的高跨语言准确性(88.33%),语言影响最小(6.67%±0.00%)。在时间限制下,住院医师的表现显著下降(调整后的准确性从73.2%降至56.5%;Cohen's d = 1.009;P <.001),而AI系统的下降幅度较小。错误模式分析表明AI与人类推理之间存在中度相关性(r = 0.666;P <.001)。住院医师从1年级(44.7%)到5年级(87.1%)表现出系统性进步。整合分析发现不同培训水平的获益各不相同,早期职业住院医师获益最大(提高29.7%;P <.001)。

结论

高性能AI大语言模型在语言和时间压力下具有很强的诊断准确性和适应性。这些发现表明,AI辅助决策可能在妇产科培训项目中带来特别的益处,尤其是对初级住院医师,通过提高诊断一致性并可能减轻时间敏感临床环境中的认知负担。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f468/12190988/d33692728d54/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f468/12190988/af03d015c137/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f468/12190988/ff34aaa1cf88/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f468/12190988/bcbc66d0c914/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f468/12190988/7aa57be71406/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f468/12190988/d33692728d54/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f468/12190988/af03d015c137/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f468/12190988/ff34aaa1cf88/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f468/12190988/bcbc66d0c914/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f468/12190988/7aa57be71406/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f468/12190988/d33692728d54/gr5.jpg

相似文献

1
The PERFORM Study: Artificial Intelligence Versus Human Residents in Cross-Sectional Obstetrics-Gynecology Scenarios Across Languages and Time Constraints.PERFORM研究:跨语言和时间限制的妇产科横断面场景中人工智能与住院医师的比较
Mayo Clin Proc Digit Health. 2025 Mar 8;3(2):100206. doi: 10.1016/j.mcpdig.2025.100206. eCollection 2025 Jun.
2
Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.使用大语言模型对黄蜂蜇伤进行临床管理:横断面评估研究
J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.
3
Artificial intelligence for detecting keratoconus.人工智能在圆锥角膜检测中的应用。
Cochrane Database Syst Rev. 2023 Nov 15;11(11):CD014911. doi: 10.1002/14651858.CD014911.pub2.
4
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
5
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
6
Drugs for preventing postoperative nausea and vomiting in adults after general anaesthesia: a network meta-analysis.成人全身麻醉后预防术后恶心呕吐的药物:网状Meta分析
Cochrane Database Syst Rev. 2020 Oct 19;10(10):CD012859. doi: 10.1002/14651858.CD012859.pub2.
7
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状Meta分析。
Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.
8
Large Language Model-Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study.使用修订后的偏倚风险工具在随机对照试验中进行大语言模型辅助的偏倚风险评估:可用性研究
J Med Internet Res. 2025 Jun 24;27:e70450. doi: 10.2196/70450.
9
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状荟萃分析。
Cochrane Database Syst Rev. 2017 Dec 22;12(12):CD011535. doi: 10.1002/14651858.CD011535.pub2.
10
Contributions of Artificial Intelligence Reported in Obstetrics and Gynecology Journals: Systematic Review.人工智能在妇产科期刊中的应用:系统评价。
J Med Internet Res. 2022 Apr 20;24(4):e35465. doi: 10.2196/35465.

本文引用的文献

1
Transforming Health Care With Artificial Intelligence: Redefining Medical Documentation.利用人工智能变革医疗保健:重新定义医疗记录。
Mayo Clin Proc Digit Health. 2024 May 22;2(3):342-347. doi: 10.1016/j.mcpdig.2024.05.006. eCollection 2024 Sep.
2
Artificial intelligence generates proficient Spanish obstetrics and gynecology counseling templates.人工智能生成了专业的西班牙语妇产科咨询模板。
AJOG Glob Rep. 2024 Sep 19;4(4):100400. doi: 10.1016/j.xagr.2024.100400. eCollection 2024 Nov.
3
Efficacy of large language models and their potential in Obstetrics and Gynecology education.
大语言模型的效能及其在妇产科教育中的潜力。
Obstet Gynecol Sci. 2024 Nov;67(6):550-556. doi: 10.5468/ogs.24211. Epub 2024 Oct 2.
4
Quality of interaction between clinicians and artificial intelligence systems. A systematic review.临床医生与人工智能系统之间的交互质量。一项系统评价。
Future Healthc J. 2024 Aug 17;11(3):100172. doi: 10.1016/j.fhj.2024.100172. eCollection 2024 Sep.
5
ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source.ChatGPT-4 知道基础知识,但无法注明其来源。
JB JS Open Access. 2024 Sep 5;9(3). doi: 10.2106/JBJS.OA.24.00099. eCollection 2024 Jul-Sep.
6
Large language models could make natural language again the universal interface of healthcare.大型语言模型可以使自然语言再次成为医疗保健的通用界面。
Nat Med. 2024 Oct;30(10):2708-2710. doi: 10.1038/s41591-024-03199-w.
7
A future role for health applications of large language models depends on regulators enforcing safety standards.大语言模型在健康应用方面的未来作用取决于监管机构执行安全标准。
Lancet Digit Health. 2024 Sep;6(9):e662-e672. doi: 10.1016/S2589-7500(24)00124-9.
8
Can Chat-GPT read and understand guidelines? An example using the S2k guideline intrauterine growth restriction of the German Society for Gynecology and Obstetrics.Chat-GPT 能阅读和理解指南吗?以德国妇产科学会的 S2k 指南宫内生长受限为例。
Arch Gynecol Obstet. 2024 Nov;310(5):2425-2437. doi: 10.1007/s00404-024-07667-z. Epub 2024 Aug 5.
9
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
10
The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs).ChatGPT在医学与医疗保健领域的伦理问题:关于大语言模型(LLMs)的系统综述
NPJ Digit Med. 2024 Jul 8;7(1):183. doi: 10.1038/s41746-024-01157-x.