Patel Serena, Patel Rohit
General Surgery, Imperial College NHS Trust, Ilford, GBR.
Oral and Maxillofacial Surgery, Kings College Hospital, London, GBR.
Cureus. 2024 Dec 18;16(12):e75961. doi: 10.7759/cureus.75961. eCollection 2024 Dec.
Background It is recognised that large language models (LLMs) may aid medical education by supporting the understanding of explanations behind answers to multiple choice questions. This study aimed to evaluate the efficacy of LLM chatbots ChatGPT and Bard in answering an Intermediate Life Support pre-course multiple choice question (MCQs) test developed by the Resuscitation Council UK focused on managing deteriorating patients and identifying causes and treating cardiac arrest. We assessed the accuracy of responses and quality of explanations to evaluate the utility of the chatbots. Methods The performance of the AI chatbots ChatGPT-3.5 and Bard were assessed on their ability to choose the correct answer and provide clear comprehensive explanations in answering MCQs developed by the Resuscitation Council UK for their Intermediate Life Support Course. Ten MCQs were tested with a total score of 40, with one point scored for each accurate response to each statement a-d. In a separate scoring, questions were scored out of 1 if all sub-statements a-d were correct, to give a total score out of 10 for the test. The explanations provided by the AI chatbots were evaluated by three qualified physicians as per a rating scale from 0-3 for each overall question and median rater scores calculated and compared. The Fleiss multi-rater kappa (κ) was used to determine the score agreement among the three raters. Results When scoring each overall question to give a total score out of 10, Bard outperformed ChatGPT although the difference was not significant (p=0.37). Furthermore, there was no statistically significant difference in the performance of ChatGPT compared to Bard when scoring each sub-question separately to give a total score out of 40 (p=0.26). The qualities of explanations were similar for both LLMs. Importantly, despite answering certain questions incorrectly, both AI chatbots provided some useful correct information in their explanations of the answers to these questions. The Fleiss multi-rater kappa was 0.899 (p<0.001) for ChatGPT and 0.801 (p<0.001) for Bard. Conclusions The performances of both Bard and ChatGPT were similar in answering the MCQs with similar scores achieved. Notably, despite having access to data across the web, neither of the LLMs answered all questions accurately. This suggests that there is still learning required of AI models in medical education.
背景 人们认识到,大语言模型(LLMs)可以通过支持对多项选择题答案背后的解释的理解来辅助医学教育。本研究旨在评估大语言模型聊天机器人ChatGPT和Bard在回答由英国复苏委员会开发的中级生命支持课程前多项选择题(MCQs)测试中的效果,该测试聚焦于管理病情恶化的患者以及识别心脏骤停的原因和进行治疗。我们评估了回答的准确性和解释的质量,以评估聊天机器人的效用。方法 对人工智能聊天机器人ChatGPT-3.5和Bard在选择正确答案以及为英国复苏委员会为其中级生命支持课程开发的多项选择题提供清晰全面解释方面的能力进行了评估。测试了10道多项选择题,总分40分,对每个陈述a - d的每个正确回答得1分。在另一种评分方式中,如果陈述a - d全部正确,则该问题得1分,测试总分为10分。由三位合格的医生根据从0到3的评分量表对人工智能聊天机器人提供的解释进行评估,计算并比较评分者的中位数分数。使用Fleiss多评分者kappa(κ)来确定三位评分者之间的分数一致性。结果 当对每个整体问题评分以得出总分10分时,Bard的表现优于ChatGPT,尽管差异不显著(p = 0.37)。此外,当分别对每个子问题评分以得出总分40分时,ChatGPT与Bard的表现没有统计学上的显著差异(p = 0.26)。两种大语言模型的解释质量相似。重要的是,尽管某些问题回答错误,但两个人工智能聊天机器人在对这些问题答案的解释中都提供了一些有用的正确信息。ChatGPT的Fleiss多评分者kappa为0.899(p < 0.001),Bard的为0.801(p < 0.001)。结论 Bard和ChatGPT在回答多项选择题时表现相似,得分相近。值得注意的是,尽管可以访问网络上的数据,但两个大语言模型都没有准确回答所有问题。这表明在医学教育中,人工智能模型仍需学习。