UAB Heersink School of Medicine, 1670 University Blvd, Birmingham, AL, 35233, United States, 1 2566539498.
University of South Alabama Whiddon College of Medicine, Mobile, AL, United States.
JMIR Med Educ. 2024 Nov 6;10:e63430. doi: 10.2196/63430.
Recent studies, including those by the National Board of Medical Examiners, have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of LLM performance in specific medical content areas, thus limiting an assessment of their potential utility in medical education.
This study aimed to assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management.
This study used 750 clinical vignette-based multiple-choice questions to characterize the performance of successive ChatGPT versions (ChatGPT 3.5 [GPT-3.5], ChatGPT 4 [GPT-4], and ChatGPT 4 Omni [GPT-4o]) across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models' performances.
GPT-4o achieved the highest accuracy across 750 multiple-choice questions at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0%, respectively. GPT-4o's highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o's diagnostic accuracy was 92.7% and management accuracy was 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI 58.3-60.3).
GPT-4o's performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness.
最近的研究,包括国家医学考试委员会的研究,都强调了最近的大型语言模型(如 ChatGPT)在通过美国医师执照考试(USMLE)方面的出色能力。然而,对于特定医学内容领域中语言模型表现的详细分析仍存在空白,从而限制了对其在医学教育中潜在效用的评估。
本研究旨在评估和比较连续的 ChatGPT 版本(GPT-3.5、GPT-4 和 GPT-4 Omni)在 USMLE 学科、临床实习以及诊断和管理临床技能方面的准确性。
本研究使用了 750 个基于临床病例的多项选择题来描述连续的 ChatGPT 版本(ChatGPT 3.5[GPT-3.5]、ChatGPT 4[GPT-4]和 ChatGPT 4 Omni[GPT-4o])在 USMLE 学科、临床实习以及诊断和管理临床技能方面的表现。使用标准化协议评估准确性,并进行统计分析以比较模型的性能。
GPT-4o 在 750 个多项选择题中取得了最高的准确性,达到 90.4%,优于 GPT-4 和 GPT-3.5,它们的准确率分别为 81.1%和 60.0%。GPT-4o 的最高表现是在社会科学(95.5%)、行为和神经科学(94.2%)和药理学(93.2%)。在临床技能方面,GPT-4o 的诊断准确率为 92.7%,管理准确率为 88.8%,明显高于其前身。值得注意的是,GPT-4o 和 GPT-4 的表现都显著优于医学生的平均准确率 59.3%(95%置信区间 58.3-60.3)。
GPT-4o 在 USMLE 学科、临床实习和临床技能方面的表现表明其相对于前身有了显著的改进,这表明该技术作为医学生教育辅助工具具有很大的应用潜力。这些发现强调了在将大型语言模型整合到医学教育中时需要谨慎考虑,强调了需要结构化课程来指导其正确使用,并需要持续进行批判性分析以确保其可靠性和有效性。