评估ChatGPT 4.0在英国医学执照考试（UKMLA）中的能力：一项全面的分类分析。

Assessing ChatGPT 4.0's Capabilities in the United Kingdom Medical Licensing Examination (UKMLA): A Robust Categorical Analysis.

作者信息

Casals-Farre Octavi, Baskaran Ravanth, Singh Aditya, Kaur Harmeena, Ul Hoque Tazim, de Almeida Andreia, Coffey Marcus, Hassoulas Athanasios

机构信息

Centre for Medical Education (C4ME), School of Medicine, Cardiff University, Heath Park Campus, Cardiff, CF14 4YS, United Kingdom.

OSCEazy Research Collaborative, Heath Park Campus, Cardiff, CF14 4YS, United Kingdom.

出版信息

Sci Rep. 2025 Apr 15;15(1):13031. doi: 10.1038/s41598-025-97327-2.

DOI:10.1038/s41598-025-97327-2

PMID:40234701

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12000555/

Abstract

Advances in the various applications of artificial intelligence will have important implications for medical training and practice. The advances in ChatGPT-4 alongside the introduction of the medical licensing assessment (MLA) provide an opportunity to compare GPT-4's medical competence against the expected level of a United Kingdom junior doctor and discuss its potential in clinical practice. Using 191 freely available questions in MLA style, we assessed GPT-4's accuracy with and without offering multiple-choice options. We compared single and multi-step questions, which targeted different points in the clinical process, from diagnosis to management. A chi-squared test was used to assess statistical significance. GPT-4 scored 86.3% and 89.6% in papers one-and-two respectively. Without the multiple-choice options, GPT's performance was 61.5% and 74.7% in papers one-and-two respectively. There was no significant difference between single and multistep questions, but GPT-4 answered 'management' questions significantly worse than 'diagnosis' questions with no multiple-choice options (p = 0.015). GPT-4's accuracy across categories and question structures suggest that LLMs are competently able to process clinical scenarios but remain incapable of understanding these clinical scenarios. Large-Language-Models incorporated into practice alongside a trained practitioner may balance risk and benefit as the necessary robust testing on evolving tools is conducted.

摘要

人工智能各种应用的进展将对医学培训和实践产生重要影响。ChatGPT-4的进展以及医学许可评估（MLA）的引入，为将GPT-4的医学能力与英国初级医生的预期水平进行比较，并讨论其在临床实践中的潜力提供了一个机会。我们使用191个免费提供的MLA风格问题，评估了GPT-4在提供和不提供多项选择题选项情况下的准确性。我们比较了针对临床过程中从诊断到管理不同点的单步和多步问题。使用卡方检验来评估统计学意义。GPT-4在第一篇和第二篇论文中的得分分别为86.3%和89.6%。在没有多项选择题选项的情况下，GPT在第一篇和第二篇论文中的表现分别为61.5%和74.7%。单步和多步问题之间没有显著差异，但在没有多项选择题选项的情况下，GPT-4回答“管理”问题的表现明显比“诊断”问题差（p = 0.015）。GPT-4在各类别和问题结构中的准确性表明，大语言模型能够胜任地处理临床场景，但仍然无法理解这些临床场景。在进行必要的对不断发展的工具的严格测试时，将大语言模型与训练有素的从业者一起纳入实践可能会平衡风险和收益。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/39b6/12000555/dc86b487132d/41598_2025_97327_Fig1_HTML.jpg

相似文献

Assessing ChatGPT 4.0's Capabilities in the United Kingdom Medical Licensing Examination (UKMLA): A Robust Categorical Analysis.评估ChatGPT 4.0在英国医学执照考试（UKMLA）中的能力：一项全面的分类分析。

Sci Rep. 2025 Apr 15;15(1):13031. doi: 10.1038/s41598-025-97327-2.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现：系统评价和荟萃分析。

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis.ChatGPT-4 在 USMLE 学科和临床技能中的全能表现：比较分析。

JMIR Med Educ. 2024 Nov 6;10:e63430. doi: 10.2196/63430.

The Rapid Development of Artificial Intelligence: GPT-4's Performance on Orthopedic Surgery Board Questions.人工智能的快速发展：GPT-4 在骨科手术委员会问题上的表现。

Orthopedics. 2024 Mar-Apr;47(2):e85-e89. doi: 10.3928/01477447-20230922-05. Epub 2023 Sep 27.

Performance of ChatGPT-3.5 and ChatGPT-4 in the Taiwan National Pharmacist Licensing Examination: Comparative Evaluation Study.ChatGPT-3.5和ChatGPT-4在台湾国家药剂师执照考试中的表现：比较评估研究。

JMIR Med Educ. 2025 Jan 17;11:e56850. doi: 10.2196/56850.

Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports.评估ChatGPT 4.0在美国医师执照考试第二步临床知识考试（USMLE STEP 2 CK）及临床病例报告中的测试表现和临床诊断准确性。

Sci Rep. 2024 Apr 23;14(1):9330. doi: 10.1038/s41598-024-58760-x.

Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.揭示GPT-4V在美国医师执照考试（USMLE）问题上高精度背后的隐藏挑战：观察性研究。

J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study.GPT-3.5 和 GPT-4 与医学生在书面德语文凭考试中的表现比较：观察性研究。

JMIR Med Educ. 2024 Feb 8;10:e50965. doi: 10.2196/50965.

Deriving insights from enhanced accuracy: Leveraging prompt engineering in custom GPT for assessing Chinese Nursing Licensing Exam.从更高的准确性中获取见解：在定制GPT中利用提示工程来评估中国护士执业资格考试。

Nurse Educ Pract. 2025 Mar;84:104284. doi: 10.1016/j.nepr.2025.104284. Epub 2025 Feb 4.

Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists.GPT 各代产品在专为认证医师为认证临床骨密度技师而设计的考试中的表现。

J Clin Densitom. 2024 Apr-Jun;27(2):101480. doi: 10.1016/j.jocd.2024.101480. Epub 2024 Feb 17.

引用本文的文献

Artificial intelligence in ophthalmology: opportunities, challenges, and ethical considerations.眼科领域的人工智能：机遇、挑战与伦理考量。

Med Hypothesis Discov Innov Ophthalmol. 2025 May 10;14(1):255-272. doi: 10.51329/mehdiophthal1517. eCollection 2025 Spring.

Answer to the letter to the editor of X. Zhang, et al. concerning "AI versus the spinal surgeons in the management of controversial spinal surgery scenarios" by Mehmet, S. et al. (Eur spine J [2025]: doi.org/10.1007/s00586-025-08825-w).对张X等人致编辑信的回复，该信涉及Mehmet, S.等人发表的“人工智能与脊柱外科医生在处理有争议的脊柱手术情况中的比较”（《欧洲脊柱杂志》[2025]：doi.org/10.1007/s00586-025-08825-w）

Eur Spine J. 2025 Jul;34(7):3054-3055. doi: 10.1007/s00586-025-08938-2. Epub 2025 May 28.

本文引用的文献

ChatGPT Performance in the UK Medical Licensing Assessment: How to Train the Next Generation?ChatGPT在英国医学执照评估中的表现：如何培养下一代？

Mayo Clin Proc Digit Health. 2023 Jul 7;1(3):309-310. doi: 10.1016/j.mcpdig.2023.06.004. eCollection 2023 Sep.

Can ChatGPT pass the MRCP (UK) written examinations? Analysis of performance and errors using a clinical decision-reasoning framework.ChatGPT 能否通过英国皇家内科医师学会会员资格考试（MRCP（UK））？使用临床决策推理框架分析表现和错误。

BMJ Open. 2024 Mar 15;14(3):e080558. doi: 10.1136/bmjopen-2023-080558.

Exploring Generative Artificial Intelligence-Assisted Medical Education: Assessing Case-Based Learning for Medical Students.探索生成式人工智能辅助医学教育：评估面向医学生的基于案例的学习

Cureus. 2024 Jan 9;16(1):e51961. doi: 10.7759/cureus.51961. eCollection 2024 Jan.

Delayed Hemolytic Transfusion Reaction With Hyperhemolysis Syndrome Due to Anti-M Alloantibody in Myelofibrosis: A Case Report.骨髓纤维化中抗-M同种抗体所致延迟性溶血性输血反应伴高溶血综合征：一例报告

Cureus. 2023 Dec 18;15(12):e50717. doi: 10.7759/cureus.50717. eCollection 2023 Dec.

Black Box Warning: Large Language Models and the Future of Infectious Diseases Consultation.黑框警告：大型语言模型与传染病咨询的未来。

Clin Infect Dis. 2024 Apr 10;78(4):860-866. doi: 10.1093/cid/ciad633.

Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration.利用生成式人工智能和大语言模型：医疗保健整合综合路线图。

Healthcare (Basel). 2023 Oct 20;11(20):2776. doi: 10.3390/healthcare11202776.

Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment.评估ChatGPT-4在英国医学执照评估中的表现。

Front Med (Lausanne). 2023 Sep 19;10:1240915. doi: 10.3389/fmed.2023.1240915. eCollection 2023.

ChatGPT: these are not hallucinations - they're fabrications and falsifications.ChatGPT：这些不是幻觉——它们是编造和伪造。

Schizophrenia (Heidelb). 2023 Aug 19;9(1):52. doi: 10.1038/s41537-023-00379-4.

ChatGPT performance in the medical specialty exam: An observational study.ChatGPT 在医学专业考试中的表现：一项观察性研究。

Medicine (Baltimore). 2023 Aug 11;102(32):e34673. doi: 10.1097/MD.0000000000034673.

Artificial Intelligence (AI) Chatbots in Medicine: A Supplement, Not a Substitute.医学中的人工智能聊天机器人：是补充，而非替代。

Cureus. 2023 Jun 25;15(6):e40922. doi: 10.7759/cureus.40922. eCollection 2023 Jun.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估ChatGPT 4.0在英国医学执照考试（UKMLA）中的能力：一项全面的分类分析。

Assessing ChatGPT 4.0's Capabilities in the United Kingdom Medical Licensing Examination (UKMLA): A Robust Categorical Analysis.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献