Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT's performance on different medical licensing examinations.
In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education.
We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses.
A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5's performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non-English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5's (P=.03) and GPT-4's (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT's accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs.
GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education.
PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687.
在过去的 2 年中,研究人员已经使用各种医学执照考试来测试 ChatGPT(OpenAI)是否具有准确的医学知识。ChatGPT 在多个环境中的医学执照考试中的表现显示出显著的差异。在现阶段,对于 ChatGPT 在不同医学执照考试中的表现的可变性仍然缺乏全面的了解。
本研究回顾了截至 2024 年 3 月所有关于 ChatGPT 在医学执照考试中的表现的研究。本综述旨在通过对 ChatGPT 在各种环境中的表现进行全面分析,为人工智能(AI)在医学教育中的不断发展的讨论做出贡献。从这个系统的综述中获得的见解将指导教育者、政策制定者和技术专家有效和明智地在医学教育中使用 AI。
我们在 Web of Science、PubMed 和 Scopus 中搜索查询字符串,搜索了 2022 年 1 月 1 日至 2024 年 3 月 29 日期间发表的文献。两位作者根据纳入和排除标准筛选文献,提取数据,并独立评估与诊断准确性研究质量评估-2 相关的文献质量。我们进行了定性和定量分析。
本研究共纳入了 45 项关于不同版本的 ChatGPT 在医学执照考试中的表现的研究。GPT-4 的总体准确率为 81%(95%CI 78-84;P<.01),明显高于 GPT-3.5 的 58%(95%CI 53-63;P<.01)。GPT-4 在 29 个案例中的 26 个案例中通过了医学考试,在 17 个案例中的 13 个案例中的表现优于医学生的平均成绩。将考试问题翻译成英语提高了 GPT-3.5 的表现,但对 GPT-4 没有影响。GPT-3.5 在来自英语国家和非英语国家的考试中表现没有差异(P=.72),但 GPT-4 在来自英语国家的考试中表现明显更好(P=.02)。任何类型的提示都可以显著提高 GPT-3.5(P=.03)和 GPT-4(P<.01)的表现。GPT-3.5 在短文本问题上的表现优于长文本问题。问题的难度影响了 GPT-3.5 和 GPT-4 的表现。在基于图像的多项选择题(MCQ)中,ChatGPT 的准确率范围从 13.1%到 100%。ChatGPT 在开放式问题上的表现明显逊于 MCQ。
GPT-4 在未来的医学教育中具有相当大的潜力。然而,由于其准确性不足、表现不一致以及各国医学政策和知识的差异带来的挑战,GPT-4 目前还不适合在医学教育中使用。
PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687。