Stengel Felix C, Stienen Martin N, Ivanov Marcel, Gandía-González María L, Raffa Giovanni, Ganau Mario, Whitfield Peter, Motov Stefan
Department of Neurosurgery & Spine Center of Eastern Switzerland, Kantonsspital St. Gallen & Medical School of St.Gallen, St. Gallen, Switzerland.
Royal Hallamshire Hospital, Sheffield, United Kingdom.
Brain Spine. 2024 Feb 13;4:102765. doi: 10.1016/j.bas.2024.102765. eCollection 2024.
Artificial intelligence (AI) based large language models (LLM) contain enormous potential in education and training. Recent publications demonstrated that they are able to outperform participants in written medical exams.
We aimed to explore the accuracy of AI in the written part of the EANS board exam.
Eighty-six representative single best answer (SBA) questions, included at least ten times in prior EANS board exams, were selected by the current EANS board exam committee. The questions' content was classified as 75 text-based (TB) and 11 image-based (IB) and their structure as 50 interpretation-weighted, 30 theory-based and 6 true-or-false. Questions were tested with Chat GPT 3.5, Bing and Bard. The AI and participant results were statistically analyzed through ANOVA tests with Stata SE 15 (StataCorp, College Station, TX). P-values of <0.05 were considered as statistically significant.
The Bard LLM achieved the highest accuracy with 62% correct questions overall and 69% excluding IB, outperforming human exam participants 59% (p = 0.67) and 59% (p = 0.42), respectively. All LLMs scored highest in theory-based questions, excluding IB questions (Chat-GPT: 79%; Bing: 83%; Bard: 86%) and significantly better than the human exam participants (60%; p = 0.03). AI could not answer any IB question correctly.
AI passed the written EANS board exam based on representative SBA questions and achieved results close to or even better than the human exam participants. Our results raise several ethical and practical implications, which may impact the current concept for the written EANS board exam.
基于人工智能(AI)的大语言模型(LLM)在教育和培训方面具有巨大潜力。最近的出版物表明,它们在书面医学考试中表现优于考生。
我们旨在探讨人工智能在欧洲神经外科医师协会(EANS)委员会书面考试中的准确性。
当前的EANS委员会考试委员会选择了86道具有代表性的单项最佳答案(SBA)问题,这些问题在之前的EANS委员会考试中至少出现过十次。问题内容分为75道基于文本(TB)的问题和11道基于图像(IB)的问题,其结构分为50道解释加权型、30道理论型和6道是非型。使用Chat GPT 3.5、必应(Bing)和巴德(Bard)对这些问题进行测试。通过使用Stata SE 15(StataCorp公司,得克萨斯州大学站)进行方差分析测试,对人工智能和考生的结果进行统计分析。P值<0.05被认为具有统计学意义。
巴德大语言模型的准确率最高,总体正确问题率为62%,不包括基于图像的问题时为69%,分别优于人类考生的59%(p = 0.67)和59%(p = 0.42)。所有大语言模型在不包括基于图像问题的理论型问题上得分最高(Chat-GPT:79%;必应:83%;巴德:86%),且显著优于人类考生(60%;p = 0.03)。人工智能无法正确回答任何基于图像的问题。
基于具有代表性的单项最佳答案问题,人工智能通过了EANS委员会书面考试,取得了与人类考生相近甚至更好的成绩。我们的结果引发了一些伦理和实际问题,可能会影响当前EANS委员会书面考试的理念。