Misr University for Science and Technology, 6th of October, Egypt.
Medical Research Platform (MRP), Giza, Egypt.
Sci Rep. 2024 Aug 14;14(1):18859. doi: 10.1038/s41598-024-68996-2.
Large language models (LLMs) like ChatGPT have potential applications in medical education such as helping students study for their licensing exams by discussing unclear questions with them. However, they require evaluation on these complex tasks. The purpose of this study was to evaluate how well publicly accessible LLMs performed on simulated UK medical board exam questions. 423 board-style questions from 9 UK exams (MRCS, MRCP, etc.) were answered by seven LLMs (ChatGPT-3.5, ChatGPT-4, Bard, Perplexity, Claude, Bing, Claude Instant). There were 406 multiple-choice, 13 true/false, and 4 "choose N" questions covering topics in surgery, pediatrics, and other disciplines. The accuracy of the output was graded. Statistics were used to analyze differences among LLMs. Leaked questions were excluded from the primary analysis. ChatGPT 4.0 scored (78.2%), Bing (67.2%), Claude (64.4%), and Claude Instant (62.9%). Perplexity scored the lowest (56.1%). Scores differed significantly between LLMs overall (p < 0.001) and in pairwise comparisons. All LLMs scored higher on multiple-choice vs true/false or "choose N" questions. LLMs demonstrated limitations in answering certain questions, indicating refinements needed before primary reliance in medical education. However, their expanding capabilities suggest a potential to improve training if thoughtfully implemented. Further research should explore specialty specific LLMs and optimal integration into medical curricula.
大型语言模型(LLMs),如 ChatGPT,在医学教育中具有潜在的应用,例如通过与学生讨论不清楚的问题来帮助他们准备执照考试。然而,它们需要在这些复杂任务上进行评估。本研究旨在评估公共可用的 LLM 在模拟英国医学委员会考试问题上的表现如何。从 9 项英国考试(MRCS、MRCP 等)中抽取了 423 道选择题、13 道是非题和 4 道“选择 N”题,由 7 个 LLM(ChatGPT-3.5、ChatGPT-4、Bard、Perplexity、Claude、Bing、Claude Instant)回答。涵盖了外科、儿科和其他学科的主题。输出的准确性进行了评分。使用统计学分析了 LLM 之间的差异。主要分析排除了泄露的问题。ChatGPT 4.0 得分(78.2%)、Bing(67.2%)、Claude(64.4%)和 Claude Instant(62.9%)。Perplexity 的得分最低(56.1%)。总体而言,LLM 之间的得分存在显著差异(p<0.001),并且在两两比较中也是如此。所有 LLM 在多项选择题上的得分均高于是非题或“选择 N”题。LLM 在回答某些问题时表现出局限性,这表明在医学教育中主要依赖之前需要进行改进。然而,它们不断扩大的能力表明,如果精心实施,有可能提高培训效果。进一步的研究应该探索专业特定的 LLM 以及最佳整合到医学课程中。