评估ChatGPT-4在英国医学执照评估中的表现。

INTRODUCTION: Recent developments in artificial intelligence large language models (LLMs), such as ChatGPT, have allowed for the understanding and generation of human-like text. Studies have found LLMs abilities to perform well in various examinations including law, business and medicine. This study aims to evaluate the performance of ChatGPT in the United Kingdom Medical Licensing Assessment (UKMLA). METHODS: Two publicly available UKMLA papers consisting of 200 single-best-answer (SBA) questions were screened. Nine SBAs were omitted as they contained images that were not suitable for input. Each question was assigned a specialty based on the UKMLA content map published by the General Medical Council. A total of 191 SBAs were inputted in ChatGPT-4 through three attempts over the course of 3 weeks (once per week). RESULTS: ChatGPT scored 74.9% (143/191), 78.0% (149/191) and 75.6% (145/191) on three attempts, respectively. The average of all three attempts was 76.3% (437/573) with a 95% confidence interval of (74.46% and 78.08%). ChatGPT answered 129 SBAs correctly and 32 SBAs incorrectly on all three attempts. On three attempts, ChatGPT performed well in mental health (8/9 SBAs), cancer (11/14 SBAs) and cardiovascular (10/13 SBAs). On three attempts, ChatGPT did not perform well in clinical haematology (3/7 SBAs), endocrine and metabolic (2/5 SBAs) and gastrointestinal including liver (3/10 SBAs). Regarding to response consistency, ChatGPT provided correct answers consistently in 67.5% (129/191) of SBAs but provided incorrect answers consistently in 12.6% (24/191) and inconsistent response in 19.9% (38/191) of SBAs, respectively. DISCUSSION AND CONCLUSION: This study suggests ChatGPT performs well in the UKMLA. There may be a potential correlation between specialty performance. LLMs ability to correctly answer SBAs suggests that it could be utilised as a supplementary learning tool in medical education with appropriate medical educator supervision.

引言：诸如ChatGPT之类的人工智能大语言模型（LLMs）的最新发展使得能够理解和生成类人文本。研究发现，大语言模型在包括法律、商业和医学在内的各种考试中表现出色。本研究旨在评估ChatGPT在英国医学许可评估（UKMLA）中的表现。方法：筛选了两篇公开可用的包含200个最佳单项选择题（SBA）的UKMLA试卷。由于其中包含不适合输入的图像，9个SBA被排除。根据英国医学总会发布的UKMLA内容图谱，为每个问题指定一个专业领域。在3周内分三次（每周一次）将总共191个SBA输入ChatGPT-4。结果：ChatGPT在三次尝试中的得分分别为74.9%（143/191）、78.0%（149/191）和75.6%（145/191）。三次尝试的平均得分是76.3%（437/573），95%置信区间为（74.46%和78.08%）。ChatGPT在所有三次尝试中正确回答了129个SBA，错误回答了32个SBA。在三次尝试中，ChatGPT在心理健康（8/9个SBA）、癌症（11/14个SBA）和心血管（10/13个SBA）方面表现良好。在三次尝试中，ChatGPT在临床血液学（3/7个SBA）、内分泌和代谢（2/5个SBA）以及包括肝脏在内的胃肠道（3/10个SBA）方面表现不佳。关于回答的一致性，ChatGPT在67.5%（129/191）的SBA中始终提供正确答案，但在12.6%（24/191）的SBA中始终提供错误答案，在19.9%（38/191）的SBA中回答不一致。讨论与结论：本研究表明ChatGPT在UKMLA中表现良好。专业表现之间可能存在潜在关联。大语言模型正确回答SBA的能力表明，在适当的医学教育工作者监督下，它可以用作医学教育的辅助学习工具。

新学期，新优惠

Suppr 超能文献

新学期，新优惠

Suppr 超能文献

Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

推荐工具