ChatGPT和GPT-4在神经外科笔试中的表现。

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.

作者信息

Ali Rohaid, Tang Oliver Y, Connolly Ian D, Zadnik Sullivan Patricia L, Shin John H, Fridley Jared S, Asaad Wael F, Cielo Deus, Oyelese Adetokunbo A, Doberstein Curtis E, Gokaslan Ziya L, Telfeian Albert E

机构信息

Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence , Rhode Island , USA.

Department of Neurosurgery, Massachusetts General Hospital, Boston , Massachusetts , USA.

出版信息

Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15.

DOI:10.1227/neu.0000000000002632

PMID:37581444

Abstract

BACKGROUND AND OBJECTIVES

Interest surrounding generative large language models (LLMs) has rapidly grown. Although ChatGPT (GPT-3.5), a general LLM, has shown near-passing performance on medical student board examinations, the performance of ChatGPT or its successor GPT-4 on specialized examinations and the factors affecting accuracy remain unclear. This study aims to assess the performance of ChatGPT and GPT-4 on a 500-question mock neurosurgical written board examination.

METHODS

The Self-Assessment Neurosurgery Examinations (SANS) American Board of Neurological Surgery Self-Assessment Examination 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. χ 2 , Fisher exact, and univariable logistic regression tests were used to assess performance differences in relation to question characteristics.

RESULTS

ChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% CI: 69.3%-77.2%) and 83.4% (95% CI: 79.8%-86.5%), respectively, relative to the user average of 72.8% (95% CI: 68.6%-76.6%). Both LLMs exceeded last year's passing threshold of 69%. Although scores between ChatGPT and question bank users were equivalent ( P = .963), GPT-4 outperformed both (both P < .001). GPT-4 answered every question answered correctly by ChatGPT and 37.6% (50/133) of remaining incorrect questions correctly. Among 12 question categories, GPT-4 significantly outperformed users in each but performed comparably with ChatGPT in 3 (functional, other general, and spine) and outperformed both users and ChatGPT for tumor questions. Increased word count (odds ratio = 0.89 of answering a question correctly per +10 words) and higher-order problem-solving (odds ratio = 0.40, P = .009) were associated with lower accuracy for ChatGPT, but not for GPT-4 (both P > .005). Multimodal input was not available at the time of this study; hence, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based on contextual context clues alone.

CONCLUSION

LLMs achieved passing scores on a mock 500-question neurosurgical written board examination, with GPT-4 significantly outperforming ChatGPT.

摘要

背景与目的

围绕生成式大语言模型（LLMs）的关注度迅速上升。尽管通用大语言模型ChatGPT（GPT - 3.5）在医学生资格考试中表现接近及格水平，但ChatGPT或其后续版本GPT - 4在专业考试中的表现以及影响准确性的因素仍不明确。本研究旨在评估ChatGPT和GPT - 4在一场500道题的模拟神经外科笔试中的表现。

方法

使用美国神经外科委员会自我评估考试1中的自我评估神经外科学考试（SANS）来评估ChatGPT和GPT - 4。问题采用单项最佳答案的多项选择题形式。使用卡方检验、Fisher精确检验和单变量逻辑回归检验来评估与问题特征相关的表现差异。

结果

相对于用户平均得分72.8%（95%置信区间：68.6% - 76.6%），ChatGPT（GPT - 3.5）和GPT - 4的得分分别为73.4%（95%置信区间：69.3% - 77.2%）和83.4%（95%置信区间：79.8% - 86.5%）。两个大语言模型的得分均超过了去年69%的及格分数线。尽管ChatGPT与题库用户的得分相当（P = 0.963），但GPT - 4的表现优于两者（P均 < 0.001）。GPT - 4答对了ChatGPT答对的每一道题，并且正确回答了其余错误问题中的37.6%（50/133）。在12个问题类别中，GPT - 4在每个类别中的表现均显著优于用户，但在3个类别（功能、其他综合和脊柱）中与ChatGPT表现相当，在肿瘤问题上的表现优于用户和ChatGPT。单词数量增加（每增加10个单词正确回答问题的优势比 = 0.89）和高阶问题解决（优势比 = 0.40，P = 0.009）与ChatGPT的较低准确性相关，但与GPT - 4无关（P均 > 0.005）。在本研究进行时无法使用多模态输入；因此，对于有图像内容的问题，ChatGPT和GPT - 4仅根据上下文线索分别正确回答了49.5%和56.8%的问题。

结论

大语言模型在一场500道题的模拟神经外科笔试中取得了及格分数，其中GPT - 4的表现显著优于ChatGPT。

相似文献

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.

Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15.

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.

Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.

GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions.

World Neurosurg. 2023 Nov;179:e160-e165. doi: 10.1016/j.wneu.2023.08.042. Epub 2023 Aug 18.

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society.

Jpn J Radiol. 2024 Feb;42(2):201-207. doi: 10.1007/s11604-023-01491-2. Epub 2023 Oct 4.

Performance of ChatGPT and Bard in self-assessment questions for nephrology board renewal.

Clin Exp Nephrol. 2024 May;28(5):465-469. doi: 10.1007/s10157-023-02451-w. Epub 2024 Feb 14.

Performance of Large Language Models on a Neurology Board-Style Examination.

JAMA Netw Open. 2023 Dec 1;6(12):e2346721. doi: 10.1001/jamanetworkopen.2023.46721.

Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study.

JMIR Med Educ. 2024 Jan 18;10:e50842. doi: 10.2196/50842.

Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations.

J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4.

Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023.

Cureus. 2024 Jul 22;16(7):e65083. doi: 10.7759/cureus.65083. eCollection 2024 Jul.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

引用本文的文献

The age of Artificial Intelligence in neurosurgical practice and educational paradigms - Considerations from the EANS Ethico-Legal and Young Neurosurgeons' committees.

Brain Spine. 2025 Jul 30;5:104375. doi: 10.1016/j.bas.2025.104375. eCollection 2025.

Specialized AI and neurosurgeons in niche expertise: a proof-of-concept in neuromodulation with vagus nerve stimulation.

Acta Neurochir (Wien). 2025 Jul 25;167(1):203. doi: 10.1007/s00701-025-06610-8.

Evaluating the role of large language models in traditional Chinese medicine diagnosis and treatment recommendations.

NPJ Digit Med. 2025 Jul 21;8(1):466. doi: 10.1038/s41746-025-01845-2.

Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study.

JMIR AI. 2025 Mar 12;4:e67696. doi: 10.2196/67696.

Large language models versus traditional textbooks: optimizing learning for plastic surgery case preparation.

BMC Med Educ. 2025 Jul 1;25(1):984. doi: 10.1186/s12909-025-07550-8.

Comparative analysis of ChatGPT 3.5 and ChatGPT 4 obstetric and gynecological knowledge.

Sci Rep. 2025 Jul 1;15(1):21133. doi: 10.1038/s41598-025-08424-1.

Assessing Accuracy of Chat Generative Pre-Trained Transformer's Responses to Common Patient Questions Regarding Congenital Upper Limb Differences.

J Hand Surg Glob Online. 2025 May 31;7(4):100764. doi: 10.1016/j.jhsg.2025.100764. eCollection 2025 Jul.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Assessing large language models as assistive tools in medical consultations for Kawasaki disease.

Front Artif Intell. 2025 Mar 31;8:1571503. doi: 10.3389/frai.2025.1571503. eCollection 2025.

Artificial Intelligence Augmentation: Performance of GPT-4 and GPT-3.5 on the Plastic Surgery In-service Examination.

Plast Reconstr Surg Glob Open. 2025 Apr 10;13(4):e6645. doi: 10.1097/GOX.0000000000006645. eCollection 2025 Apr.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ChatGPT和GPT-4在神经外科笔试中的表现。

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.

作者信息

机构信息

出版信息

BACKGROUND AND OBJECTIVES

METHODS

RESULTS

CONCLUSION

背景与目的

方法

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献