人工智能聊天机器人在英国医学考试问题上有前景但也有限制：一项比较性能研究。

AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study.

机构信息

Misr University for Science and Technology, 6th of October, Egypt.

Medical Research Platform (MRP), Giza, Egypt.

出版信息

Sci Rep. 2024 Aug 14;14(1):18859. doi: 10.1038/s41598-024-68996-2.

DOI:10.1038/s41598-024-68996-2

PMID:39143077

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11324724/

Abstract

Large language models (LLMs) like ChatGPT have potential applications in medical education such as helping students study for their licensing exams by discussing unclear questions with them. However, they require evaluation on these complex tasks. The purpose of this study was to evaluate how well publicly accessible LLMs performed on simulated UK medical board exam questions. 423 board-style questions from 9 UK exams (MRCS, MRCP, etc.) were answered by seven LLMs (ChatGPT-3.5, ChatGPT-4, Bard, Perplexity, Claude, Bing, Claude Instant). There were 406 multiple-choice, 13 true/false, and 4 "choose N" questions covering topics in surgery, pediatrics, and other disciplines. The accuracy of the output was graded. Statistics were used to analyze differences among LLMs. Leaked questions were excluded from the primary analysis. ChatGPT 4.0 scored (78.2%), Bing (67.2%), Claude (64.4%), and Claude Instant (62.9%). Perplexity scored the lowest (56.1%). Scores differed significantly between LLMs overall (p < 0.001) and in pairwise comparisons. All LLMs scored higher on multiple-choice vs true/false or "choose N" questions. LLMs demonstrated limitations in answering certain questions, indicating refinements needed before primary reliance in medical education. However, their expanding capabilities suggest a potential to improve training if thoughtfully implemented. Further research should explore specialty specific LLMs and optimal integration into medical curricula.

摘要

大型语言模型（LLMs），如 ChatGPT，在医学教育中具有潜在的应用，例如通过与学生讨论不清楚的问题来帮助他们准备执照考试。然而，它们需要在这些复杂任务上进行评估。本研究旨在评估公共可用的 LLM 在模拟英国医学委员会考试问题上的表现如何。从 9 项英国考试（MRCS、MRCP 等）中抽取了 423 道选择题、13 道是非题和 4 道“选择 N”题，由 7 个 LLM（ChatGPT-3.5、ChatGPT-4、Bard、Perplexity、Claude、Bing、Claude Instant）回答。涵盖了外科、儿科和其他学科的主题。输出的准确性进行了评分。使用统计学分析了 LLM 之间的差异。主要分析排除了泄露的问题。ChatGPT 4.0 得分（78.2%）、Bing（67.2%）、Claude（64.4%）和 Claude Instant（62.9%）。Perplexity 的得分最低（56.1%）。总体而言，LLM 之间的得分存在显著差异（p<0.001），并且在两两比较中也是如此。所有 LLM 在多项选择题上的得分均高于是非题或“选择 N”题。LLM 在回答某些问题时表现出局限性，这表明在医学教育中主要依赖之前需要进行改进。然而，它们不断扩大的能力表明，如果精心实施，有可能提高培训效果。进一步的研究应该探索专业特定的 LLM 以及最佳整合到医学课程中。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b698/11324724/f5ae6571e69b/41598_2024_68996_Fig1_HTML.jpg

相似文献

AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study.人工智能聊天机器人在英国医学考试问题上有前景但也有限制：一项比较性能研究。

Sci Rep. 2024 Aug 14;14(1):18859. doi: 10.1038/s41598-024-68996-2.

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。

Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.

Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam.大语言模型在多专科招聘评估（MSRA）考试中的表现评估。

Comput Biol Med. 2024 Jan;168:107794. doi: 10.1016/j.compbiomed.2023.107794. Epub 2023 Nov 30.

Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models-Bing, Claude, ChatGPT, and Perplexity.肺栓塞影像学检查的放射学决策：大语言模型——必应、克劳德、ChatGPT和Perplexity的准确性与可靠性

Indian J Radiol Imaging. 2024 Jul 4;34(4):653-660. doi: 10.1055/s-0044-1787974. eCollection 2024 Oct.

Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions.人工智能在减重手术中的表现：ChatGPT-4、Bing 和 Bard 在《美国代谢与减重外科学会减重手术教科书》减重手术问题中的比较分析。

Surg Obes Relat Dis. 2024 Jul;20(7):609-613. doi: 10.1016/j.soard.2024.04.014. Epub 2024 May 8.

Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study.ChatGPT 在中美护理执照考试中的表现：横断面研究。

JMIR Med Educ. 2024 Oct 3;10:e52746. doi: 10.2196/52746.

Evaluating the Efficacy of AI Chatbots as Tutors in Urology: A Comparative Analysis of Responses to the 2022 In-Service Assessment of the European Board of Urology.评估 AI 聊天机器人在泌尿科教学中的疗效：对 2022 年欧洲泌尿外科学会在职评估的回应进行比较分析。

Urol Int. 2024;108(4):359-366. doi: 10.1159/000537854. Epub 2024 Mar 30.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能：比较混合方法研究。

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

A Comparative Analysis of ChatGPT, ChatGPT-4, and Google Bard Performances at the Advanced Burn Life Support Exam.ChatGPT、ChatGPT-4 和 Google Bard 在高级烧伤生命支持考试中的表现比较分析。

J Burn Care Res. 2024 Aug 6;45(4):945-948. doi: 10.1093/jbcr/irae044.

Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: A pilot study.聊天机器人与临床医生回答儿科牙科问题的准确性和一致性：一项试点研究。

J Dent. 2024 May;144:104938. doi: 10.1016/j.jdent.2024.104938. Epub 2024 Apr 3.

引用本文的文献

Evaluation of deepseek, gemini, ChatGPT-4o, and perplexity in responding to salivary gland cancer.评估DeepSeek、Gemini、ChatGPT-4o和Perplexity对涎腺癌的回答。

BMC Oral Health. 2025 Aug 23;25(1):1358. doi: 10.1186/s12903-025-06726-4.

Preparing for Vascular Surgery Board Certification: A Comparative Study Using Large Language Models.为血管外科委员会认证做准备：一项使用大语言模型的比较研究。

Cureus. 2025 May 10;17(5):e83848. doi: 10.7759/cureus.83848. eCollection 2025 May.

ChatGPT Answers the 110-Question Laboratory Enzymology Student Exam: Pass or Fail?ChatGPT 回答110道实验室酶学学生考试题目：及格还是不及格？

Cureus. 2025 Apr 13;17(4):e82168. doi: 10.7759/cureus.82168. eCollection 2025 Apr.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性：系统评价与网络荟萃分析

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Opportunities and Challenges of Chatbots in Ophthalmology: A Narrative Review.眼科领域中聊天机器人的机遇与挑战：一篇叙述性综述

J Pers Med. 2024 Dec 21;14(12):1165. doi: 10.3390/jpm14121165.

本文引用的文献

Data leakage in deep learning studies of translational EEG.转化性脑电图深度学习研究中的数据泄露。

Front Neurosci. 2024 May 3;18:1373515. doi: 10.3389/fnins.2024.1373515. eCollection 2024.

Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.评估ChatGPT在眼科领域的表现：对其优缺点的分析。

Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.

Performance of ChatGPT on Specialty Certificate Examination in Dermatology multiple-choice questions.ChatGPT 在皮肤病学多选题专业证书考试中的表现。

Clin Exp Dermatol. 2024 Jun 25;49(7):722-727. doi: 10.1093/ced/llad197.

Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions.医学教育中的大语言模型：机遇、挑战与未来方向。

JMIR Med Educ. 2023 Jun 1;9:e48291. doi: 10.2196/48291.

ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations.医学领域的ChatGPT：其应用、优势、局限性、未来前景及伦理考量概述

Front Artif Intell. 2023 May 4;6:1169595. doi: 10.3389/frai.2023.1169595. eCollection 2023.

Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations.ChatGPT 在放射科 Board 考试中的表现：当前优势和局限性的深入了解。

Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. Epub 2023 May 16.

Performance of ChatGPT on UK Standardized Admission Tests: Insights From the BMAT, TMUA, LNAT, and TSA Examinations.ChatGPT在英国标准化入学考试中的表现：来自生物医学入学考试、大学数学入学测试、全国法律入学考试和思维技能评估考试的见解

JMIR Med Educ. 2023 Apr 26;9:e47737. doi: 10.2196/47737.

ChatGPT passing USMLE shines a spotlight on the flaws of medical education.ChatGPT 通过美国医师执照考试凸显了医学教育的缺陷。

PLOS Digit Health. 2023 Feb 9;2(2):e0000205. doi: 10.1371/journal.pdig.0000205. eCollection 2023 Feb.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试（USMLE）中的表现如何？大语言模型对医学教育和知识评估的影响。

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images.深度学习分类 OCT 图像中因数据泄露导致的测试精度膨胀。

Sci Data. 2022 Sep 22;9(1):580. doi: 10.1038/s41597-022-01618-6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

人工智能聊天机器人在英国医学考试问题上有前景但也有限制：一项比较性能研究。

AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献