ChatGPT和GPT-4在神经外科笔试中的表现。

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.

作者信息

Ali Rohaid, Tang Oliver Y, Connolly Ian D, Zadnik Sullivan Patricia L, Shin John H, Fridley Jared S, Asaad Wael F, Cielo Deus, Oyelese Adetokunbo A, Doberstein Curtis E, Gokaslan Ziya L, Telfeian Albert E

机构信息

Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence , Rhode Island , USA.

Department of Neurosurgery, Massachusetts General Hospital, Boston , Massachusetts , USA.

出版信息

Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15.

Abstract

BACKGROUND AND OBJECTIVES

Interest surrounding generative large language models (LLMs) has rapidly grown. Although ChatGPT (GPT-3.5), a general LLM, has shown near-passing performance on medical student board examinations, the performance of ChatGPT or its successor GPT-4 on specialized examinations and the factors affecting accuracy remain unclear. This study aims to assess the performance of ChatGPT and GPT-4 on a 500-question mock neurosurgical written board examination.

METHODS

The Self-Assessment Neurosurgery Examinations (SANS) American Board of Neurological Surgery Self-Assessment Examination 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. χ 2 , Fisher exact, and univariable logistic regression tests were used to assess performance differences in relation to question characteristics.

RESULTS

ChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% CI: 69.3%-77.2%) and 83.4% (95% CI: 79.8%-86.5%), respectively, relative to the user average of 72.8% (95% CI: 68.6%-76.6%). Both LLMs exceeded last year's passing threshold of 69%. Although scores between ChatGPT and question bank users were equivalent ( P = .963), GPT-4 outperformed both (both P < .001). GPT-4 answered every question answered correctly by ChatGPT and 37.6% (50/133) of remaining incorrect questions correctly. Among 12 question categories, GPT-4 significantly outperformed users in each but performed comparably with ChatGPT in 3 (functional, other general, and spine) and outperformed both users and ChatGPT for tumor questions. Increased word count (odds ratio = 0.89 of answering a question correctly per +10 words) and higher-order problem-solving (odds ratio = 0.40, P = .009) were associated with lower accuracy for ChatGPT, but not for GPT-4 (both P > .005). Multimodal input was not available at the time of this study; hence, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based on contextual context clues alone.

CONCLUSION

LLMs achieved passing scores on a mock 500-question neurosurgical written board examination, with GPT-4 significantly outperforming ChatGPT.

摘要

背景与目的

围绕生成式大语言模型(LLMs)的关注度迅速上升。尽管通用大语言模型ChatGPT(GPT - 3.5)在医学生资格考试中表现接近及格水平,但ChatGPT或其后续版本GPT - 4在专业考试中的表现以及影响准确性的因素仍不明确。本研究旨在评估ChatGPT和GPT - 4在一场500道题的模拟神经外科笔试中的表现。

方法

使用美国神经外科委员会自我评估考试1中的自我评估神经外科学考试(SANS)来评估ChatGPT和GPT - 4。问题采用单项最佳答案的多项选择题形式。使用卡方检验、Fisher精确检验和单变量逻辑回归检验来评估与问题特征相关的表现差异。

结果

相对于用户平均得分72.8%(95%置信区间:68.6% - 76.6%),ChatGPT(GPT - 3.5)和GPT - 4的得分分别为73.4%(95%置信区间:69.3% - 77.2%)和83.4%(95%置信区间:79.8% - 86.5%)。两个大语言模型的得分均超过了去年69%的及格分数线。尽管ChatGPT与题库用户的得分相当(P = 0.963),但GPT - 4的表现优于两者(P均 < 0.001)。GPT - 4答对了ChatGPT答对的每一道题,并且正确回答了其余错误问题中的37.6%(50/133)。在12个问题类别中,GPT - 4在每个类别中的表现均显著优于用户,但在3个类别(功能、其他综合和脊柱)中与ChatGPT表现相当,在肿瘤问题上的表现优于用户和ChatGPT。单词数量增加(每增加10个单词正确回答问题的优势比 = 0.89)和高阶问题解决(优势比 = 0.40,P = 0.009)与ChatGPT的较低准确性相关,但与GPT - 4无关(P均 > 0.005)。在本研究进行时无法使用多模态输入;因此,对于有图像内容的问题,ChatGPT和GPT - 4仅根据上下文线索分别正确回答了49.5%和56.8%的问题。

结论

大语言模型在一场500道题的模拟神经外科笔试中取得了及格分数,其中GPT - 4的表现显著优于ChatGPT。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索