Çamlar Mahmut, Sevgi Umut Tan, Erol Gökberk, Karakaş Furkan, Doğruel Yücel, Güngör Abuzer
Department of Neurosurgery, Izmir City Hospital, University of Health Sciences, Şevket İnce Neighborhood, 2148/11 Street, No:1/11, 35540, Bayraklı, İzmir, Turkey.
Department of Neurosurgery, Adıyaman Training and Research Hospital, Adıyaman, Turkey.
Acta Neurochir (Wien). 2025 Sep 9;167(1):241. doi: 10.1007/s00701-025-06628-y.
Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context.
A total of 599 multiple-choice questions drawn from Turkish Board examinations and an English neurosurgery question bank were presented to six LLMs (ChatGPT-o1pro, ChatGPT-4, AtlasGPT, Gemini, Copilot, and ChatGPT-3.5). Correctness rates were compared using the proportion z-test, and inter-model consistency was examined using Cohen's kappa.
ChatGPT-o1pro, ChatGPT-4, and AtlasGPT demonstrated relatively high accuracy for Single Best Answer-Recall of Knowledge (SBA-R), Single Best Answer-Interpretative Application of Knowledge (SBA-I), and True/False question types; however, performance notably decreased for questions with images, with some models leaving many unanswered items.
Our findings suggest that GPT-4-based models and AtlasGPT can handle specialized neurosurgery questions at a near-expert level for SBA-R, SBA-I, and True/False formats. Nevertheless, all models exhibit notable limitations in questions with images, indicating that these tools remain supplementary rather than definitive solutions for neurosurgical training and decision-making.
最近的研究表明,诸如ChatGPT之类的大语言模型(LLMs)在医学生或住院医师备考时是有用的工具。这些研究,尤其是那些针对多项选择题进行的研究,强调大语言模型的知识水平和回答一致性总体上是可以接受的;然而,在病例讨论、解释和语言熟练度等方面仍需要进一步优化。因此,本研究旨在评估六种不同的大语言模型在土耳其语和英语神经外科多项选择题上的表现,并在专业医学背景下评估它们的准确性和一致性。
从土耳其委员会考试和一个英语神经外科题库中选取了总共599道多项选择题,呈现给六种大语言模型(ChatGPT-o1pro、ChatGPT-4、AtlasGPT、Gemini、Copilot和ChatGPT-3.5)。使用比例z检验比较正确率,并使用科恩kappa检验模型间的一致性。
ChatGPT-o1pro、ChatGPT-4和AtlasGPT在单项最佳答案-知识回忆(SBA-R)、单项最佳答案-知识解释应用(SBA-I)和是非题类型上表现出相对较高的准确性;然而,对于带有图像的问题,表现明显下降,一些模型留下了许多未作答的题目。
我们的研究结果表明,基于GPT-4的模型和AtlasGPT在SBA-R、SBA-I和是非题格式方面能够以接近专家的水平处理专业神经外科问题。然而,所有模型在带有图像的问题上都表现出明显的局限性,这表明这些工具仍然只是神经外科培训和决策的辅助手段,而非决定性解决方案。