Guy's Hospital, Guy's and St Thomas' NHS Foundation Trust, Great Maze Pond, London, United Kingdom.
Basel, Switzerland.
PLoS One. 2024 Jul 31;19(7):e0307372. doi: 10.1371/journal.pone.0307372. eCollection 2024.
As a large language model (LLM) trained on a large data set, ChatGPT can perform a wide array of tasks without additional training. We evaluated the performance of ChatGPT on postgraduate UK medical examinations through a systematic literature review of ChatGPT's performance in UK postgraduate medical assessments and its performance on Member of Royal College of Physicians (MRCP) Part 1 examination.
Medline, Embase and Cochrane databases were searched. Articles discussing the performance of ChatGPT in UK postgraduate medical examinations were included in the systematic review. Information was extracted on exam performance including percentage scores and pass/fail rates. MRCP UK Part 1 sample paper questions were inserted into ChatGPT-3.5 and -4 four times each and the scores marked against the correct answers provided.
12 studies were ultimately included in the systematic literature review. ChatGPT-3.5 scored 66.4% and ChatGPT-4 scored 84.8% on MRCP Part 1 sample paper, which is 4.4% and 22.8% above the historical pass mark respectively. Both ChatGPT-3.5 and -4 performance was significantly above the historical pass mark for MRCP Part 1, indicating they would likely pass this examination. ChatGPT-3.5 failed eight out of nine postgraduate exams it performed with an average percentage of 5.0% below the pass mark. ChatGPT-4 passed nine out of eleven postgraduate exams it performed with an average percentage of 13.56% above the pass mark. ChatGPT-4 performance was significantly better than ChatGPT-3.5 in all examinations that both models were tested on.
ChatGPT-4 performed at above passing level for the majority of UK postgraduate medical examinations it was tested on. ChatGPT is prone to hallucinations, fabrications and reduced explanation accuracy which could limit its potential as a learning tool. The potential for these errors is an inherent part of LLMs and may always be a limitation for medical applications of ChatGPT.
作为一个基于大型数据集训练的大型语言模型(LLM),ChatGPT 无需额外培训即可执行广泛的任务。我们通过系统地综述 ChatGPT 在英国研究生医学评估中的表现及其在皇家内科医师学会会员(MRCP)第 1 部分考试中的表现,评估了 ChatGPT 在英国研究生医学考试中的表现。
检索了 Medline、Embase 和 Cochrane 数据库。系统综述中纳入了讨论 ChatGPT 在英国研究生医学考试中表现的文章。提取了考试表现的信息,包括百分比分数和通过/失败率。将 MRCP UK 第 1 部分的样题问题插入 ChatGPT-3.5 和 -4 中各四次,并根据提供的正确答案进行打分。
最终有 12 项研究被纳入系统文献综述。ChatGPT-3.5 在 MRCP 第 1 部分的样卷中得分为 66.4%,ChatGPT-4 得分为 84.8%,分别比历史及格分数高 4.4%和 22.8%。ChatGPT-3.5 和 -4 的表现均明显高于 MRCP 第 1 部分的历史及格分数,表明它们很可能通过这次考试。ChatGPT-3.5 在其参加的九项研究生考试中失败了八项,平均低于及格分数 5.0%。ChatGPT-4 在其参加的十一项研究生考试中通过了九项,平均高于及格分数 13.56%。在两个模型都参加的所有考试中,ChatGPT-4 的表现均明显优于 ChatGPT-3.5。
ChatGPT-4 在其参加的大多数英国研究生医学考试中表现均达到及格水平以上。ChatGPT 容易出现幻觉、编造和降低解释准确性,这可能限制了它作为学习工具的潜力。这些错误的可能性是大型语言模型的固有部分,并且可能一直是 ChatGPT 在医学应用中的一个限制。