Department of Emergency Medicine, Konkuk University Medical Center, Seoul, Republic of Korea.
Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea.
Medicine (Baltimore). 2024 Mar 1;103(9):e37325. doi: 10.1097/MD.0000000000037325.
Large language models (LLMs) have been deployed in diverse fields, and the potential for their application in medicine has been explored through numerous studies. This study aimed to evaluate and compare the performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Emergency Medicine Board Examination question bank in the Korean language. Of the 2353 questions in the question bank, 150 questions were randomly selected, and 27 containing figures were excluded. Questions that required abilities such as analysis, creative thinking, evaluation, and synthesis were classified as higher-order questions, and those that required only recall, memory, and factual information in response were classified as lower-order questions. The answers and explanations obtained by inputting the 123 questions into the LLMs were analyzed and compared. ChatGPT-4 (75.6%) and Bing Chat (70.7%) showed higher correct response rates than ChatGPT-3.5 (56.9%) and Bard (51.2%). ChatGPT-4 showed the highest correct response rate for the higher-order questions at 76.5%, and Bard and Bing Chat showed the highest rate for the lower-order questions at 71.4%. The appropriateness of the explanation for the answer was significantly higher for ChatGPT-4 and Bing Chat than for ChatGPT-3.5 and Bard (75.6%, 68.3%, 52.8%, and 50.4%, respectively). ChatGPT-4 and Bing Chat outperformed ChatGPT-3.5 and Bard in answering a random selection of Emergency Medicine Board Examination questions in the Korean language.
大型语言模型(LLMs)已经在各个领域得到应用,其在医学领域的应用潜力已经通过大量研究得到了探索。本研究旨在评估和比较 ChatGPT-3.5、ChatGPT-4、Bing Chat 和 Bard 在韩语版急诊医学委员会考试题库中的性能。在题库的 2353 个问题中,随机抽取了 150 个问题,排除了 27 个包含图表的问题。需要分析、创造性思维、评估和综合能力的问题被归类为高阶问题,而只需要回答回忆、记忆和事实信息的问题被归类为低阶问题。将 123 个问题输入到 LLM 中获得的答案和解释进行了分析和比较。ChatGPT-4(75.6%)和 Bing Chat(70.7%)的正确回答率高于 ChatGPT-3.5(56.9%)和 Bard(51.2%)。ChatGPT-4 对高阶问题的正确回答率最高,为 76.5%,而 Bard 和 Bing Chat 对低阶问题的回答率最高,为 71.4%。答案解释的恰当性方面,ChatGPT-4 和 Bing Chat 明显优于 ChatGPT-3.5 和 Bard(分别为 75.6%、68.3%、52.8%和 50.4%)。ChatGPT-4 和 Bing Chat 在回答韩语版随机选择的急诊医学委员会考试问题方面优于 ChatGPT-3.5 和 Bard。