Martinelli Canio, Giordano Antonio, Carnevale Vincenzo, Burk Sharon Raffaella, Porto Lavinia, Vizzielli Giuseppe, Ercoli Alfredo
Sbarro Institute for Cancer Research and Molecular Medicine and Center of Biotechnology, College of Science and Technology, Temple University, Philadelphia, PA.
Department of Human Pathology of Adult and Childhood "Gaetano Barresi," Unit of Obstetrics and Gynecology, University of Messina, Messina, Italy.
Mayo Clin Proc Digit Health. 2025 Mar 8;3(2):100206. doi: 10.1016/j.mcpdig.2025.100206. eCollection 2025 Jun.
To systematically evaluate the performance of artificial intelligence (AI) large language models (LLMs) compared with obstetrics-gynecology residents in clinical decision-making, examining diagnostic accuracy and error patterns across linguistic domains, time constraints, and experience levels.
In this cross-sectional study, we evaluated 8 AI LLMs and 24 obstetrics-gynecology residents (Years 1-5) using 60 standardized clinical scenarios. Most AI LLMs and all residents were assessed in May 2024, whereas chat GPT-01-preview, chat-GPT4o, and Claude Sonnet 3.5 were evaluated in November 2024. The assessment framework incorporated English and Italian scenarios under both timed and untimed conditions, along with systematic error pattern analysis. The primary outcome was diagnostic accuracy; secondary end points included AI system stratification, resident progression, language impact, time pressure effects, and integration potential.
The AI LLMs reported superior overall accuracy (73.75%; 95% confidence interval [CI], 69.64%-77.49%) compared with residents (65.35%; 95% CI, 62.85%-67.76%; <.001). High-performing AI systems (ChatGPT-01-preview, GPT4o, and Claude Sonnet 3.5) achieved consistently high cross-linguistic accuracy (88.33%) with minimal language impact (6.67%±0.00%). Resident performance declined significantly under time constraints (from 73.2% to 56.5% adjusted accuracy; Cohen's d=1.009; <.001), whereas AI systems reported lesser deterioration. Error pattern analysis indicated a moderate correlation between AI and human reasoning (r=0.666; <.001). Residents exhibited systematic progression from year 1 (44.7%) to year 5 (87.1%). Integration analysis found variable benefits across training levels, with maximum enhancement in early-career residents (+29.7%; <.001).
High-performing AI LLMs reported strong diagnostic accuracy and resilience under linguistic and temporal pressures. These findings suggest that AI-enhanced decision-making may offer particular benefits in obstetrics and gynecology training programs, especially for junior residents, by improving diagnostic consistency and potentially reducing cognitive load in time-sensitive clinical settings.
系统评估人工智能(AI)大语言模型(LLMs)与妇产科住院医师在临床决策方面的表现,考察跨语言领域、时间限制和经验水平的诊断准确性及错误模式。
在这项横断面研究中,我们使用60个标准化临床场景评估了8个AI大语言模型和24名妇产科住院医师(1至5年级)。大多数AI大语言模型和所有住院医师于2024年5月接受评估,而Chat GPT - 01 - preview、Chat - GPT4o和Claude Sonnet 3.5于2024年11月接受评估。评估框架纳入了有时间限制和无时间限制条件下的英语和意大利语场景,以及系统的错误模式分析。主要结果是诊断准确性;次要终点包括AI系统分层、住院医师进展、语言影响、时间压力效应和整合潜力。
与住院医师(65.35%;95%置信区间[CI],62.85% - 67.76%;P <.001)相比,AI大语言模型总体准确性更高(73.75%;95% CI,69.64% - 77.49%)。高性能AI系统(ChatGPT - 01 - preview、GPT4o和Claude Sonnet 3.5)实现了始终如一的高跨语言准确性(88.33%),语言影响最小(6.67%±0.00%)。在时间限制下,住院医师的表现显著下降(调整后的准确性从73.2%降至56.5%;Cohen's d = 1.009;P <.001),而AI系统的下降幅度较小。错误模式分析表明AI与人类推理之间存在中度相关性(r = 0.666;P <.001)。住院医师从1年级(44.7%)到5年级(87.1%)表现出系统性进步。整合分析发现不同培训水平的获益各不相同,早期职业住院医师获益最大(提高29.7%;P <.001)。
高性能AI大语言模型在语言和时间压力下具有很强的诊断准确性和适应性。这些发现表明,AI辅助决策可能在妇产科培训项目中带来特别的益处,尤其是对初级住院医师,通过提高诊断一致性并可能减轻时间敏感临床环境中的认知负担。