• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

神经外科特定的、经过同行评审的人工智能聊天机器人与通用人工智能聊天机器人在双语资格考试中的比较表现:评估准确性、一致性和错误最小化策略。

Comparative performance of neurosurgery-specific, peer-reviewed versus general AI chatbots in bilingual board examinations: evaluating accuracy, consistency, and error minimization strategies.

作者信息

Çamlar Mahmut, Sevgi Umut Tan, Erol Gökberk, Karakaş Furkan, Doğruel Yücel, Güngör Abuzer

机构信息

Department of Neurosurgery, Izmir City Hospital, University of Health Sciences, Şevket İnce Neighborhood, 2148/11 Street, No:1/11, 35540, Bayraklı, İzmir, Turkey.

Department of Neurosurgery, Adıyaman Training and Research Hospital, Adıyaman, Turkey.

出版信息

Acta Neurochir (Wien). 2025 Sep 9;167(1):241. doi: 10.1007/s00701-025-06628-y.

DOI:10.1007/s00701-025-06628-y
PMID:40924209
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12420735/
Abstract

BACKGROUND

Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context.

METHODS

A total of 599 multiple-choice questions drawn from Turkish Board examinations and an English neurosurgery question bank were presented to six LLMs (ChatGPT-o1pro, ChatGPT-4, AtlasGPT, Gemini, Copilot, and ChatGPT-3.5). Correctness rates were compared using the proportion z-test, and inter-model consistency was examined using Cohen's kappa.

RESULTS

ChatGPT-o1pro, ChatGPT-4, and AtlasGPT demonstrated relatively high accuracy for Single Best Answer-Recall of Knowledge (SBA-R), Single Best Answer-Interpretative Application of Knowledge (SBA-I), and True/False question types; however, performance notably decreased for questions with images, with some models leaving many unanswered items.

CONCLUSION

Our findings suggest that GPT-4-based models and AtlasGPT can handle specialized neurosurgery questions at a near-expert level for SBA-R, SBA-I, and True/False formats. Nevertheless, all models exhibit notable limitations in questions with images, indicating that these tools remain supplementary rather than definitive solutions for neurosurgical training and decision-making.

摘要

背景

最近的研究表明,诸如ChatGPT之类的大语言模型(LLMs)在医学生或住院医师备考时是有用的工具。这些研究,尤其是那些针对多项选择题进行的研究,强调大语言模型的知识水平和回答一致性总体上是可以接受的;然而,在病例讨论、解释和语言熟练度等方面仍需要进一步优化。因此,本研究旨在评估六种不同的大语言模型在土耳其语和英语神经外科多项选择题上的表现,并在专业医学背景下评估它们的准确性和一致性。

方法

从土耳其委员会考试和一个英语神经外科题库中选取了总共599道多项选择题,呈现给六种大语言模型(ChatGPT-o1pro、ChatGPT-4、AtlasGPT、Gemini、Copilot和ChatGPT-3.5)。使用比例z检验比较正确率,并使用科恩kappa检验模型间的一致性。

结果

ChatGPT-o1pro、ChatGPT-4和AtlasGPT在单项最佳答案-知识回忆(SBA-R)、单项最佳答案-知识解释应用(SBA-I)和是非题类型上表现出相对较高的准确性;然而,对于带有图像的问题,表现明显下降,一些模型留下了许多未作答的题目。

结论

我们的研究结果表明,基于GPT-4的模型和AtlasGPT在SBA-R、SBA-I和是非题格式方面能够以接近专家的水平处理专业神经外科问题。然而,所有模型在带有图像的问题上都表现出明显的局限性,这表明这些工具仍然只是神经外科培训和决策的辅助手段,而非决定性解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38f3/12420735/b826c706f868/701_2025_6628_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38f3/12420735/9e96e1aa1b06/701_2025_6628_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38f3/12420735/b826c706f868/701_2025_6628_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38f3/12420735/9e96e1aa1b06/701_2025_6628_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38f3/12420735/b826c706f868/701_2025_6628_Fig2_HTML.jpg

相似文献

1
Comparative performance of neurosurgery-specific, peer-reviewed versus general AI chatbots in bilingual board examinations: evaluating accuracy, consistency, and error minimization strategies.神经外科特定的、经过同行评审的人工智能聊天机器人与通用人工智能聊天机器人在双语资格考试中的比较表现:评估准确性、一致性和错误最小化策略。
Acta Neurochir (Wien). 2025 Sep 9;167(1):241. doi: 10.1007/s00701-025-06628-y.
2
Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.大型语言模型在欧洲普通外科医师资格考试中表现不佳:与专家及外科住院医师的比较研究
BMC Med Educ. 2025 Aug 23;25(1):1193. doi: 10.1186/s12909-025-07856-7.
3
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
4
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
5
Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education.ChatGPT、Gemini与急诊医学实习最后一年学生在回答多项选择题方面的表现比较:人工智能在医学教育中的应用启示
Int J Emerg Med. 2025 Aug 7;18(1):146. doi: 10.1186/s12245-025-00949-6.
6
Stench of Errors or the Shine of Potential: The Challenge of (Ir)Responsible Use of ChatGPT in Speech-Language Pathology.错误的恶臭还是潜力的光辉:言语病理学中(不)负责任地使用ChatGPT的挑战。
Int J Lang Commun Disord. 2025 Jul-Aug;60(4):e70088. doi: 10.1111/1460-6984.70088.
7
The performance of ChatGPT versus neurosurgery residents in neurosurgical board examination-like questions: a systematic review and meta-analysis.ChatGPT与神经外科住院医师在类似神经外科委员会考试问题上的表现:一项系统评价和荟萃分析。
Neurosurg Rev. 2024 Dec 7;47(1):892. doi: 10.1007/s10143-024-03144-y.
8
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
9
Comparison of ChatGPT and Internet Research for Clinical Research and Decision-Making in Occupational Medicine: Randomized Controlled Trial.ChatGPT与互联网搜索用于职业医学临床研究和决策的比较:随机对照试验
JMIR Form Res. 2025 May 20;9:e63857. doi: 10.2196/63857.
10
Evaluating Large Language Models on American Board of Anesthesiology-style Anesthesiology Questions: Accuracy, Domain Consistency, and Clinical Implications.基于美国麻醉医师协会风格的麻醉学问题对大语言模型进行评估:准确性、领域一致性及临床意义
J Cardiothorac Vasc Anesth. 2025 Sep;39(9):2511-2515. doi: 10.1053/j.jvca.2025.05.033. Epub 2025 May 21.

本文引用的文献

1
Assessing the accuracy and quality of artificial intelligence (AI) chatbot-generated responses in making patient-specific drug-therapy and healthcare-related decisions.评估人工智能(AI)聊天机器人生成的回复在做出针对特定患者的药物治疗和医疗相关决策时的准确性和质量。
BMC Med Inform Decis Mak. 2024 Dec 24;24(1):404. doi: 10.1186/s12911-024-02824-5.
2
The performance of ChatGPT versus neurosurgery residents in neurosurgical board examination-like questions: a systematic review and meta-analysis.ChatGPT与神经外科住院医师在类似神经外科委员会考试问题上的表现:一项系统评价和荟萃分析。
Neurosurg Rev. 2024 Dec 7;47(1):892. doi: 10.1007/s10143-024-03144-y.
3
Large language models in neurosurgery: a systematic review and meta-analysis.
神经外科中的大语言模型:系统评价和荟萃分析。
Acta Neurochir (Wien). 2024 Nov 23;166(1):475. doi: 10.1007/s00701-024-06372-9.
4
Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.Gemini Advanced与ChatGPT 4.0在眼科住院医师眼科知识评估计划(OKAP)考试复习题库中的表现比较。
Cureus. 2024 Sep 17;16(9):e69612. doi: 10.7759/cureus.69612. eCollection 2024 Sep.
5
Evaluating the Performance of ChatGPT in Dermatology Specialty Certificate Examination-style Questions: A Comparative Analysis between English and Korean Language Settings.评估ChatGPT在皮肤科专科证书考试题型问题中的表现:英语和韩语环境下的对比分析。
Indian J Dermatol. 2024 Jul-Aug;69(4):338-341. doi: 10.4103/ijd.ijd_1050_23. Epub 2024 Aug 19.
6
Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis.ChatGPT-3.5 和 GPT-4 在医学、药学、牙科和护理国家执照考试中的表现:系统评价和荟萃分析。
BMC Med Educ. 2024 Sep 16;24(1):1013. doi: 10.1186/s12909-024-05944-8.
7
Impact of Large Language Models on Medical Education and Teaching Adaptations.大语言模型对医学教育及教学适应性的影响
JMIR Med Inform. 2024 Jul 25;12:e55933. doi: 10.2196/55933.
8
Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard.分析大语言模型对常见腰椎融合手术问题的回答:ChatGPT与Bard的比较
Neurospine. 2024 Jun;21(2):633-641. doi: 10.14245/ns.2448098.049. Epub 2024 Jun 30.
9
Clinical and Surgical Applications of Large Language Models: A Systematic Review.大语言模型的临床与外科应用:一项系统综述
J Clin Med. 2024 May 22;13(11):3041. doi: 10.3390/jcm13113041.
10
Breaking boundaries in neurosurgery through art and technology: A historical perspective.通过艺术与技术突破神经外科的界限:历史视角
Brain Spine. 2024 May 18;4:102836. doi: 10.1016/j.bas.2024.102836. eCollection 2024.