Lin Shih-Yi, Hsu Ying-Yu, Ju Shu-Woei, Yeh Pei-Chun, Hsu Wu-Huei, Kao Chia-Hung
Graduate Institute of Biomedical Sciences, College of Medicine, China Medical University, Taichung.
Division of Nephrology and Kidney Institute, China Medical University Hospital, Taichung.
Digit Health. 2024 Oct 18;10:20552076241291404. doi: 10.1177/20552076241291404. eCollection 2024 Jan-Dec.
The aim of this study is to evaluate the ability of generative artificial intelligence (AI) models to handle specialized medical knowledge and problem-solving in a formal examination context.
This research utilized internal medicine exam questions provided by the Taiwan Internal Medicine Society from 2020 to 2023, testing three AI models: GPT-4o, Claude_3.5 Sonnet, and Gemini Advanced models. Rejected queries for Gemini Advanced were translated into French for resubmission. Performance was assessed using IBM SPSS Statistics 26, with accuracy percentages calculated and statistical analyses such as Pearson correlation and analysis of variance (ANOVA) performed to gauge AI efficacy.
GPT-4o's top annual score was 86.25 in 2022, with an average of 81.97. Claude_3.5 Sonnet reached a peak score of 88.13 in 2021 and 2022, averaging 84.85, while Gemini Advanced lagged with an average score of 69.84. In specific specialties, Claude_3.5 Sonnet scored highest in Psychiatry (100%) and Nephrology (97.26%), with GPT-4o performing similarly well in Hematology & oncology (97.10%) and Nephrology (94.52%). Gemini's best scores were in Psychiatry (86.96%) and Hematology & Oncology (82.76%). Gemini Advanced models struggled with Neurology, scoring below 60%. Additionally, all models performed better on text-based questions than on image-based ones, without significant differences. Claude 3 Opus scored highest on COVID-19-related questions at 89.29%, followed by GPT-4o at 75.00% and Gemini Advanced at 67.86%.
AI models showed varied proficiency across medical specialties and question types. GPT-4o demonstrated higher image-based correction rates. Claude_3.5 Sonnet generally and consistently outperformed others, highlighting significant potential for AI in assisting medical education.
本研究旨在评估生成式人工智能(AI)模型在正式考试环境中处理专业医学知识和解决问题的能力。
本研究使用了台湾内科医学会2020年至2023年提供的内科考试题目,测试了三种AI模型:GPT-4o、Claude_3.5 Sonnet和Gemini Advanced模型。Gemini Advanced被拒绝的查询被翻译成法语后重新提交。使用IBM SPSS Statistics 26评估性能,计算准确率百分比,并进行Pearson相关性和方差分析(ANOVA)等统计分析以衡量AI的有效性。
GPT-4o在2022年的年度最高分数为86.25,平均分为81.97。Claude_3.5 Sonnet在2021年和2022年达到峰值分数88.13,平均分为84.85,而Gemini Advanced则落后,平均分为69.84。在特定专科中,Claude_3.5 Sonnet在精神病学(100%)和肾脏病学(97.26%)中得分最高,GPT-4o在血液学与肿瘤学(97.10%)和肾脏病学(94.52%)中表现同样出色。Gemini的最佳分数在精神病学(86.96%)和血液学与肿瘤学(82.76%)中。Gemini Advanced模型在神经病学方面表现不佳,得分低于60%。此外,所有模型在基于文本的问题上的表现都优于基于图像的问题,且无显著差异。Claude 3 Opus在与COVID-19相关的问题上得分最高,为89.29%,其次是GPT-4o,为75.00%,Gemini Advanced为67.86%。
AI模型在医学专科和问题类型方面表现出不同的熟练程度。GPT-4o在基于图像的校正率方面表现更高。Claude_3.5 Sonnet总体上且始终优于其他模型,突出了AI在辅助医学教育方面的巨大潜力。