Department of Ophthalmology, Henan Eye Hospital, Henan Provincial People's Hospital, Zhengzhou, China.
Eye Institute, Henan Academy of Innovations in Medical Science, Zhengzhou, China.
JMIR Med Educ. 2024 Aug 13;10:e52784. doi: 10.2196/52784.
With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research.
The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).
The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model's accuracy and consistency.
GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response.
GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.
随着 ChatGPT 等大型语言模型在各行业中的应用日益广泛,其在医学领域的潜力,尤其是在标准化考试中的应用,已成为研究焦点。
本研究旨在评估 ChatGPT 的临床性能,重点关注其在国家医师资格考试(CNMLE)中的准确性和可靠性。
将 2022 年 CNMLE 的 500 道单项选择题集重新分为 15 个医学亚专科。2023 年 4 月 24 日至 5 月 15 日,我们在 OpenAI 平台上以中文对每个问题进行了 8 至 12 次测试。考虑了三个关键因素:GPT-3.5 和 4.0 的版本、针对医学亚专科定制的系统角色提示以及为连贯性进行的重复。设定了 60%的及格准确率作为标准。采用 χ2 检验和κ 值评估模型的准确性和一致性。
GPT-4.0 的及格准确率为 72.7%,显著高于 GPT-3.5(54%;P<.001)。GPT-4.0 重复响应的变异性率低于 GPT-3.5(9%比 19.5%;P<.001)。然而,两个模型的响应一致性都较好,κ 值分别为 0.778 和 0.610。系统角色对 GPT-4.0(0.3%-3.7%)和 GPT-3.5(1.3%-4.5%)的准确性均有提升,变异性分别降低了 1.7%和 1.8%(P>.05)。在亚组分析中,ChatGPT 在不同题型中表现出相当的准确性(P>.05)。GPT-4.0 在 15 个亚专科中的 14 个首次回答中都超过了及格准确率,而 GPT-3.5 则在 7 个中超过了及格准确率。
GPT-4.0 通过了 CNMLE,在准确性、一致性和医学亚专科专业知识等关键领域优于 GPT-3.5。添加系统角色对模型的可靠性和答案一致性略有提升。GPT-4.0 在医学教育和临床实践中具有广阔的应用前景,值得进一步研究。