Casals-Farre Octavi, Baskaran Ravanth, Singh Aditya, Kaur Harmeena, Ul Hoque Tazim, de Almeida Andreia, Coffey Marcus, Hassoulas Athanasios
Centre for Medical Education (C4ME), School of Medicine, Cardiff University, Heath Park Campus, Cardiff, CF14 4YS, United Kingdom.
OSCEazy Research Collaborative, Heath Park Campus, Cardiff, CF14 4YS, United Kingdom.
Sci Rep. 2025 Apr 15;15(1):13031. doi: 10.1038/s41598-025-97327-2.
Advances in the various applications of artificial intelligence will have important implications for medical training and practice. The advances in ChatGPT-4 alongside the introduction of the medical licensing assessment (MLA) provide an opportunity to compare GPT-4's medical competence against the expected level of a United Kingdom junior doctor and discuss its potential in clinical practice. Using 191 freely available questions in MLA style, we assessed GPT-4's accuracy with and without offering multiple-choice options. We compared single and multi-step questions, which targeted different points in the clinical process, from diagnosis to management. A chi-squared test was used to assess statistical significance. GPT-4 scored 86.3% and 89.6% in papers one-and-two respectively. Without the multiple-choice options, GPT's performance was 61.5% and 74.7% in papers one-and-two respectively. There was no significant difference between single and multistep questions, but GPT-4 answered 'management' questions significantly worse than 'diagnosis' questions with no multiple-choice options (p = 0.015). GPT-4's accuracy across categories and question structures suggest that LLMs are competently able to process clinical scenarios but remain incapable of understanding these clinical scenarios. Large-Language-Models incorporated into practice alongside a trained practitioner may balance risk and benefit as the necessary robust testing on evolving tools is conducted.
人工智能各种应用的进展将对医学培训和实践产生重要影响。ChatGPT-4的进展以及医学许可评估(MLA)的引入,为将GPT-4的医学能力与英国初级医生的预期水平进行比较,并讨论其在临床实践中的潜力提供了一个机会。我们使用191个免费提供的MLA风格问题,评估了GPT-4在提供和不提供多项选择题选项情况下的准确性。我们比较了针对临床过程中从诊断到管理不同点的单步和多步问题。使用卡方检验来评估统计学意义。GPT-4在第一篇和第二篇论文中的得分分别为86.3%和89.6%。在没有多项选择题选项的情况下,GPT在第一篇和第二篇论文中的表现分别为61.5%和74.7%。单步和多步问题之间没有显著差异,但在没有多项选择题选项的情况下,GPT-4回答“管理”问题的表现明显比“诊断”问题差(p = 0.015)。GPT-4在各类别和问题结构中的准确性表明,大语言模型能够胜任地处理临床场景,但仍然无法理解这些临床场景。在进行必要的对不断发展的工具的严格测试时,将大语言模型与训练有素的从业者一起纳入实践可能会平衡风险和收益。