Giannos Panagiotis
Department of Life Sciences, Imperial College London, London, UK.
Society of Meta-Research and Biomedical Innovation, London, UK.
BMJ Neurol Open. 2023 Jun 15;5(1):e000451. doi: 10.1136/bmjno-2023-000451. eCollection 2023.
Large language models such as ChatGPT have demonstrated potential as innovative tools for medical education and practice, with studies showing their ability to perform at or near the passing threshold in general medical examinations and standardised admission tests. However, no studies have assessed their performance in the UK medical education context, particularly at a specialty level, and specifically in the field of neurology and neuroscience.
We evaluated the performance of ChatGPT in higher specialty training for neurology and neuroscience using 69 questions from the Pool-Specialty Certificate Examination (SCE) Neurology Web Questions bank. The dataset primarily focused on neurology (80%). The questions spanned subtopics such as symptoms and signs, diagnosis, interpretation and management with some questions addressing specific patient populations. The performance of ChatGPT 3.5 Legacy, ChatGPT 3.5 Default and ChatGPT-4 models was evaluated and compared.
ChatGPT 3.5 Legacy and ChatGPT 3.5 Default displayed overall accuracies of 42% and 57%, respectively, falling short of the passing threshold of 58% for the 2022 SCE neurology examination. ChatGPT-4, on the other hand, achieved the highest accuracy of 64%, surpassing the passing threshold and outperforming its predecessors across disciplines and subtopics.
The advancements in ChatGPT-4's performance compared with its predecessors demonstrate the potential for artificial intelligence (AI) models in specialised medical education and practice. However, our findings also highlight the need for ongoing development and collaboration between AI developers and medical experts to ensure the models' relevance and reliability in the rapidly evolving field of medicine.
诸如ChatGPT之类的大型语言模型已展现出作为医学教育和实践创新工具的潜力,研究表明它们在普通医学考试和标准化入学测试中的表现达到或接近及格分数线。然而,尚无研究评估它们在英国医学教育背景下的表现,尤其是在专科层面,特别是在神经病学和神经科学领域。
我们使用来自神经病学专科证书考试(SCE)网络题库的69道题目,评估了ChatGPT在神经病学和神经科学高级专科培训中的表现。数据集主要聚焦于神经病学(80%)。这些问题涵盖了症状和体征、诊断、解读和管理等子主题,有些问题涉及特定患者群体。对ChatGPT 3.5旧版、ChatGPT 3.5默认版和ChatGPT-4模型的表现进行了评估和比较。
ChatGPT 3.5旧版和ChatGPT 3.5默认版的总体准确率分别为42%和57%,未达到2022年SCE神经病学考试58%的及格分数线。另一方面,ChatGPT-4的准确率最高,达到64%,超过了及格分数线,并且在各学科和子主题上均优于其前身。
与前身相比,ChatGPT-4性能的提升证明了人工智能(AI)模型在专业医学教育和实践中的潜力。然而,我们的研究结果也凸显了AI开发者与医学专家持续开展开发与合作的必要性,以确保这些模型在快速发展的医学领域中的相关性和可靠性。