• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

人工智能能否通过欧洲神经外科书面考试?——伦理与实际问题。

Can AI pass the written European Board Examination in Neurological Surgery? - Ethical and practical issues.

作者信息

Stengel Felix C, Stienen Martin N, Ivanov Marcel, Gandía-González María L, Raffa Giovanni, Ganau Mario, Whitfield Peter, Motov Stefan

机构信息

Department of Neurosurgery & Spine Center of Eastern Switzerland, Kantonsspital St. Gallen & Medical School of St.Gallen, St. Gallen, Switzerland.

Royal Hallamshire Hospital, Sheffield, United Kingdom.

出版信息

Brain Spine. 2024 Feb 13;4:102765. doi: 10.1016/j.bas.2024.102765. eCollection 2024.

DOI:10.1016/j.bas.2024.102765
PMID:38510593
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10951784/
Abstract

INTRODUCTION

Artificial intelligence (AI) based large language models (LLM) contain enormous potential in education and training. Recent publications demonstrated that they are able to outperform participants in written medical exams.

RESEARCH QUESTION

We aimed to explore the accuracy of AI in the written part of the EANS board exam.

MATERIAL AND METHODS

Eighty-six representative single best answer (SBA) questions, included at least ten times in prior EANS board exams, were selected by the current EANS board exam committee. The questions' content was classified as 75 text-based (TB) and 11 image-based (IB) and their structure as 50 interpretation-weighted, 30 theory-based and 6 true-or-false. Questions were tested with Chat GPT 3.5, Bing and Bard. The AI and participant results were statistically analyzed through ANOVA tests with Stata SE 15 (StataCorp, College Station, TX). P-values of <0.05 were considered as statistically significant.

RESULTS

The Bard LLM achieved the highest accuracy with 62% correct questions overall and 69% excluding IB, outperforming human exam participants 59% (p = 0.67) and 59% (p = 0.42), respectively. All LLMs scored highest in theory-based questions, excluding IB questions (Chat-GPT: 79%; Bing: 83%; Bard: 86%) and significantly better than the human exam participants (60%; p = 0.03). AI could not answer any IB question correctly.

DISCUSSION AND CONCLUSION

AI passed the written EANS board exam based on representative SBA questions and achieved results close to or even better than the human exam participants. Our results raise several ethical and practical implications, which may impact the current concept for the written EANS board exam.

摘要

引言

基于人工智能(AI)的大语言模型(LLM)在教育和培训方面具有巨大潜力。最近的出版物表明,它们在书面医学考试中表现优于考生。

研究问题

我们旨在探讨人工智能在欧洲神经外科医师协会(EANS)委员会书面考试中的准确性。

材料与方法

当前的EANS委员会考试委员会选择了86道具有代表性的单项最佳答案(SBA)问题,这些问题在之前的EANS委员会考试中至少出现过十次。问题内容分为75道基于文本(TB)的问题和11道基于图像(IB)的问题,其结构分为50道解释加权型、30道理论型和6道是非型。使用Chat GPT 3.5、必应(Bing)和巴德(Bard)对这些问题进行测试。通过使用Stata SE 15(StataCorp公司,得克萨斯州大学站)进行方差分析测试,对人工智能和考生的结果进行统计分析。P值<0.05被认为具有统计学意义。

结果

巴德大语言模型的准确率最高,总体正确问题率为62%,不包括基于图像的问题时为69%,分别优于人类考生的59%(p = 0.67)和59%(p = 0.42)。所有大语言模型在不包括基于图像问题的理论型问题上得分最高(Chat-GPT:79%;必应:83%;巴德:86%),且显著优于人类考生(60%;p = 0.03)。人工智能无法正确回答任何基于图像的问题。

讨论与结论

基于具有代表性的单项最佳答案问题,人工智能通过了EANS委员会书面考试,取得了与人类考生相近甚至更好的成绩。我们的结果引发了一些伦理和实际问题,可能会影响当前EANS委员会书面考试的理念。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/904c182a3764/gr9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/b52746c25552/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/a36814b6d401/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/bb53032f01a7/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/fd28cd1657e9/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/e56133203f8c/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/e0453230db75/gr6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/140aa6cddd25/gr7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/a19dfec2da55/gr8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/904c182a3764/gr9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/b52746c25552/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/a36814b6d401/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/bb53032f01a7/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/fd28cd1657e9/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/e56133203f8c/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/e0453230db75/gr6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/140aa6cddd25/gr7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/a19dfec2da55/gr8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef56/10951784/904c182a3764/gr9.jpg

相似文献

1
Can AI pass the written European Board Examination in Neurological Surgery? - Ethical and practical issues.人工智能能否通过欧洲神经外科书面考试?——伦理与实际问题。
Brain Spine. 2024 Feb 13;4:102765. doi: 10.1016/j.bas.2024.102765. eCollection 2024.
2
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
3
Advancing Medical Education: Performance of Generative Artificial Intelligence Models on Otolaryngology Board Preparation Questions With Image Analysis Insights.推进医学教育:生成式人工智能模型在耳鼻喉科委员会备考问题上的表现及图像分析见解
Cureus. 2024 Jul 9;16(7):e64204. doi: 10.7759/cureus.64204. eCollection 2024 Jul.
4
Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard.人工智能聊天机器人在睡眠医学认证委员会考试中的表现:ChatGPT 与 Google Bard 对比。
Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2137-2143. doi: 10.1007/s00405-023-08381-3. Epub 2023 Dec 20.
5
Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions.生成式预训练转换器-4,一种人工智能文本预测模型,具有通过新型书面放射科考试问题的高能力。
Int J Comput Assist Radiol Surg. 2024 Apr;19(4):645-653. doi: 10.1007/s11548-024-03071-9. Epub 2024 Feb 21.
6
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型:GPT-3.5、GPT-4 和 Bard 的比较分析。
JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.
7
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.ChatGPT、GPT-4和谷歌巴德在神经外科口试准备题库上的表现。
Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.
8
Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists.GPT 各代产品在专为认证医师为认证临床骨密度技师而设计的考试中的表现。
J Clin Densitom. 2024 Apr-Jun;27(2):101480. doi: 10.1016/j.jocd.2024.101480. Epub 2024 Feb 17.
9
Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.人工智能能通过美国骨科医师学会考试吗?骨科住院医师与ChatGPT的对比。
Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.
10
GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions.GPT-4人工智能模型在类似神经外科书面考试的问题上表现优于ChatGPT、医学生和神经外科住院医师。
World Neurosurg. 2023 Nov;179:e160-e165. doi: 10.1016/j.wneu.2023.08.042. Epub 2023 Aug 18.

引用本文的文献

1
Exploring perspectives and boundaries in neurosurgical career pathways for generation Z in German-speaking countries.探索德语国家Z世代神经外科职业道路中的观点与界限。
Brain Spine. 2025 Aug 6;5:104382. doi: 10.1016/j.bas.2025.104382. eCollection 2025.
2
Can we trust academic AI detective? Accuracy and limitations of AI-output detectors.我们能信任学术人工智能侦探吗?人工智能输出检测器的准确性和局限性。
Acta Neurochir (Wien). 2025 Aug 7;167(1):214. doi: 10.1007/s00701-025-06622-4.
3
Assessing the performance of ChatGPT-4o on the Turkish Orthopedics and Traumatology Board Examination.

本文引用的文献

1
Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4.利用人工智能进行医学检查的优缺点:一项使用 GPT-4 的医学教育试点研究。
BMC Med Educ. 2023 Oct 17;23(1):772. doi: 10.1186/s12909-023-04752-w.
2
Large Language Model-Based Neurosurgical Evaluation Matrix: A Novel Scoring Criteria to Assess the Efficacy of ChatGPT as an Educational Tool for Neurosurgery Board Preparation.基于大语言模型的神经外科评估矩阵:一种评估 ChatGPT 作为神经外科委员会准备教育工具效果的新评分标准。
World Neurosurg. 2023 Dec;180:e765-e773. doi: 10.1016/j.wneu.2023.10.043. Epub 2023 Oct 14.
3
评估ChatGPT-4o在土耳其骨科学与创伤学委员会考试中的表现。
Jt Dis Relat Surg. 2025 Apr 5;36(2):304-310. doi: 10.52312/jdrs.2025.1958.
4
Performance of 5 Prominent Large Language Models in Surgical Knowledge Evaluation: A Comparative Analysis.5种著名大语言模型在外科知识评估中的表现:一项比较分析。
Mayo Clin Proc Digit Health. 2024 Jun 5;2(3):348-350. doi: 10.1016/j.mcpdig.2024.05.022. eCollection 2024 Sep.
5
Employing large language models safely and effectively as a practicing neurosurgeon.作为一名执业神经外科医生,安全有效地使用大语言模型。
Acta Neurochir (Wien). 2025 Apr 9;167(1):101. doi: 10.1007/s00701-025-06515-6.
6
Reliability, Accuracy, and Comprehensibility of AI-Based Responses to Common Patient Questions Regarding Spinal Cord Stimulation.基于人工智能对有关脊髓刺激的常见患者问题的回复的可靠性、准确性和可理解性。
J Clin Med. 2025 Feb 21;14(5):1453. doi: 10.3390/jcm14051453.
7
ChatGPT's Performance in Spinal Metastasis Cases-Can We Discuss Our Complex Cases with ChatGPT?ChatGPT在脊柱转移瘤病例中的表现——我们能与ChatGPT讨论复杂病例吗?
J Clin Med. 2024 Dec 23;13(24):7864. doi: 10.3390/jcm13247864.
8
Large language models in neurosurgery: a systematic review and meta-analysis.神经外科中的大语言模型:系统评价和荟萃分析。
Acta Neurochir (Wien). 2024 Nov 23;166(1):475. doi: 10.1007/s00701-024-06372-9.
9
Assessing ChatGPT's summarization of Ga PSMA PET/CT reports for patients.评估ChatGPT对患者Ga PSMA PET/CT报告的总结。
Abdom Radiol (NY). 2025 Mar;50(3):1467-1474. doi: 10.1007/s00261-024-04619-8. Epub 2024 Sep 30.
10
Assessment Study of ChatGPT-3.5's Performance on the Final Polish Medical Examination: Accuracy in Answering 980 Questions.ChatGPT-3.5在波兰医学期末考试中的表现评估研究:回答980个问题的准确性
Healthcare (Basel). 2024 Aug 16;12(16):1637. doi: 10.3390/healthcare12161637.
GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions.
GPT-4人工智能模型在类似神经外科书面考试的问题上表现优于ChatGPT、医学生和神经外科住院医师。
World Neurosurg. 2023 Nov;179:e160-e165. doi: 10.1016/j.wneu.2023.08.042. Epub 2023 Aug 18.
4
ChatGPT-A double-edged sword for healthcare education? Implications for assessments of dental students.ChatGPT——医学教育的双刃剑?对牙科学生评估的影响。
Eur J Dent Educ. 2024 Feb;28(1):206-211. doi: 10.1111/eje.12937. Epub 2023 Aug 7.
5
Assessing ChatGPT's ability to pass the FRCS orthopaedic part A exam: A critical analysis.评估 ChatGPT 通过 FRCS 骨科 A 部分考试的能力:批判性分析。
Surgeon. 2023 Oct;21(5):263-266. doi: 10.1016/j.surge.2023.07.001. Epub 2023 Jul 28.
6
European training requirements in neurological surgery: A new outcomes-based 3 stage UEMS curriculum.欧洲神经外科培训要求:基于新成果的欧洲医学教育联盟三阶段课程。
Brain Spine. 2023 Apr 25;3:101744. doi: 10.1016/j.bas.2023.101744. eCollection 2023.
7
Large language models for oncological applications.用于肿瘤学应用的大型语言模型。
J Cancer Res Clin Oncol. 2023 Sep;149(11):9505-9508. doi: 10.1007/s00432-023-04824-w. Epub 2023 May 9.
8
Using AI-generated suggestions from ChatGPT to optimize clinical decision support.利用 ChatGPT 生成的人工智能建议来优化临床决策支持。
J Am Med Inform Assoc. 2023 Jun 20;30(7):1237-1245. doi: 10.1093/jamia/ocad072.
9
Harnessing the power of ChatGPT in medical education.在医学教育中利用ChatGPT的力量。
Med Teach. 2023 Sep;45(9):1063. doi: 10.1080/0142159X.2023.2198094. Epub 2023 Apr 10.
10
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.