Ros-Arlanzón Pablo, Perez-Sempere Angel
Department of Neurology, Dr. Balmis General University Hospital, C/ Pintor Baeza, Nº 11, Alicante, 03010, Spain, 34 965933000.
Department of Neuroscience, Instituto de Investigación Sanitaria y Biomédica de Alicante, Alicante, Spain.
JMIR Med Educ. 2024 Nov 14;10:e56762. doi: 10.2196/56762.
With the rapid advancement of artificial intelligence (AI) in various fields, evaluating its application in specialized medical contexts becomes crucial. ChatGPT, a large language model developed by OpenAI, has shown potential in diverse applications, including medicine.
This study aims to compare the performance of ChatGPT with that of attending neurologists in a real neurology specialist examination conducted in the Valencian Community, Spain, assessing the AI's capabilities and limitations in medical knowledge.
We conducted a comparative analysis using the 2022 neurology specialist examination results from 120 neurologists and responses generated by ChatGPT versions 3.5 and 4. The examination consisted of 80 multiple-choice questions, with a focus on clinical neurology and health legislation. Questions were classified according to Bloom's Taxonomy. Statistical analysis of performance, including the κ coefficient for response consistency, was performed.
Human participants exhibited a median score of 5.91 (IQR: 4.93-6.76), with 32 neurologists failing to pass. ChatGPT-3.5 ranked 116th out of 122, answering 54.5% of questions correctly (score 3.94). ChatGPT-4 showed marked improvement, ranking 17th with 81.8% of correct answers (score 7.57), surpassing several human specialists. No significant variations were observed in the performance on lower-order questions versus higher-order questions. Additionally, ChatGPT-4 demonstrated increased interrater reliability, as reflected by a higher κ coefficient of 0.73, compared to ChatGPT-3.5's coefficient of 0.69.
This study underscores the evolving capabilities of AI in medical knowledge assessment, particularly in specialized fields. ChatGPT-4's performance, outperforming the median score of human participants in a rigorous neurology examination, represents a significant milestone in AI development, suggesting its potential as an effective tool in specialized medical education and assessment.
随着人工智能(AI)在各个领域的迅速发展,评估其在专业医学背景下的应用变得至关重要。ChatGPT是OpenAI开发的一种大型语言模型,已在包括医学在内的各种应用中显示出潜力。
本研究旨在比较ChatGPT与西班牙巴伦西亚自治区进行的实际神经科专科考试中神经科主治医生的表现,评估人工智能在医学知识方面的能力和局限性。
我们使用了120名神经科医生的2022年神经科专科考试结果以及ChatGPT 3.5和4版本生成的回答进行了比较分析。该考试由80道多项选择题组成,重点是临床神经学和健康立法。问题根据布鲁姆分类法进行分类。对表现进行了统计分析,包括回答一致性的κ系数。
人类参与者的中位数分数为5.91(四分位距:4.93 - 6.76),有32名神经科医生未通过。ChatGPT - 3.5在122个中排名第116,正确回答了54.5%的问题(分数3.94)。ChatGPT - 4表现出显著进步,以81.8%的正确答案排名第17(分数7.57),超过了几位人类专家。在低阶问题和高阶问题的表现上未观察到显著差异。此外,ChatGPT - 4显示出更高的评分者间信度,其κ系数为0.73,高于ChatGPT - 3.5的0.69系数。
本研究强调了人工智能在医学知识评估方面不断发展的能力,特别是在专业领域。ChatGPT - 4在严格的神经科考试中的表现超过了人类参与者的中位数分数,这代表了人工智能发展中的一个重要里程碑,表明其作为专业医学教育和评估的有效工具的潜力。