Yoon Soo-Hyuk, Oh Seok Kyeong, Lim Byung Gun, Lee Ho-Jin
Department of Anesthesiology and Pain Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea.
Department of Anesthesiology and Pain Medicine, Korea University Guro Hospital, Korea University College of Medicine, Seoul, Republic of Korea.
JMIR Med Educ. 2024 Sep 16;10:e56859. doi: 10.2196/56859.
ChatGPT has been tested in health care, including the US Medical Licensing Examination and specialty exams, showing near-passing results. Its performance in the field of anesthesiology has been assessed using English board examination questions; however, its effectiveness in Korea remains unexplored.
This study investigated the problem-solving performance of ChatGPT in the fields of anesthesiology and pain medicine in the Korean language context, highlighted advancements in artificial intelligence (AI), and explored its potential applications in medical education.
We investigated the performance (number of correct answers/number of questions) of GPT-4, GPT-3.5, and CLOVA X in the fields of anesthesiology and pain medicine, using in-training examinations that have been administered to Korean anesthesiology residents over the past 5 years, with an annual composition of 100 questions. Questions containing images, diagrams, or photographs were excluded from the analysis. Furthermore, to assess the performance differences of the GPT across different languages, we conducted a comparative analysis of the GPT-4's problem-solving proficiency using both the original Korean texts and their English translations.
A total of 398 questions were analyzed. GPT-4 (67.8%) demonstrated a significantly better overall performance than GPT-3.5 (37.2%) and CLOVA-X (36.7%). However, GPT-3.5 and CLOVA X did not show significant differences in their overall performance. Additionally, the GPT-4 showed superior performance on questions translated into English, indicating a language processing discrepancy (English: 75.4% vs Korean: 67.8%; difference 7.5%; 95% CI 3.1%-11.9%; P=.001).
This study underscores the potential of AI tools, such as ChatGPT, in medical education and practice but emphasizes the need for cautious application and further refinement, especially in non-English medical contexts. The findings suggest that although AI advancements are promising, they require careful evaluation and development to ensure acceptable performance across diverse linguistic and professional settings.
ChatGPT已在医疗保健领域接受测试,包括美国医师执照考试和专业考试,成绩接近及格。已使用英语委员会考试题目评估了其在麻醉学领域的表现;然而,其在韩国的有效性仍未得到探索。
本研究调查了ChatGPT在韩语环境下麻醉学和疼痛医学领域的问题解决表现,突出了人工智能(AI)的进展,并探讨了其在医学教育中的潜在应用。
我们使用过去5年对韩国麻醉学住院医师进行的在职培训考试,调查了GPT-4、GPT-3.5和CLOVA X在麻醉学和疼痛医学领域的表现(正确答案数量/问题数量),每年的题目构成有100道题。分析中排除了包含图像、图表或照片的问题。此外,为了评估GPT在不同语言中的表现差异,我们使用韩语原文及其英文翻译对GPT-4的问题解决能力进行了比较分析。
共分析了398道题。GPT-4(67.8%)的总体表现明显优于GPT-3.5(37.2%)和CLOVA-X(36.7%)。然而,GPT-3.5和CLOVA X的总体表现没有显著差异。此外,GPT-4在翻译成英语的题目上表现更优,表明存在语言处理差异(英语:75.4% 对韩语:67.8%;差异7.5%;95% CI 3.1%-11.9%;P = 0.001)。
本研究强调了ChatGPT等人工智能工具在医学教育和实践中的潜力,但强调需要谨慎应用并进一步完善,尤其是在非英语医学环境中。研究结果表明,尽管人工智能的进展很有前景,但需要仔细评估和开发,以确保在不同语言和专业环境中都有可接受的表现。