Suppr超能文献

评估 ChatGPT 在 MRCP 第 1 部分中的能力,并对其在研究生医学评估中的能力进行系统文献回顾。

Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments.

机构信息

Guy's Hospital, Guy's and St Thomas' NHS Foundation Trust, Great Maze Pond, London, United Kingdom.

Basel, Switzerland.

出版信息

PLoS One. 2024 Jul 31;19(7):e0307372. doi: 10.1371/journal.pone.0307372. eCollection 2024.

Abstract

OBJECTIVES

As a large language model (LLM) trained on a large data set, ChatGPT can perform a wide array of tasks without additional training. We evaluated the performance of ChatGPT on postgraduate UK medical examinations through a systematic literature review of ChatGPT's performance in UK postgraduate medical assessments and its performance on Member of Royal College of Physicians (MRCP) Part 1 examination.

METHODS

Medline, Embase and Cochrane databases were searched. Articles discussing the performance of ChatGPT in UK postgraduate medical examinations were included in the systematic review. Information was extracted on exam performance including percentage scores and pass/fail rates. MRCP UK Part 1 sample paper questions were inserted into ChatGPT-3.5 and -4 four times each and the scores marked against the correct answers provided.

RESULTS

12 studies were ultimately included in the systematic literature review. ChatGPT-3.5 scored 66.4% and ChatGPT-4 scored 84.8% on MRCP Part 1 sample paper, which is 4.4% and 22.8% above the historical pass mark respectively. Both ChatGPT-3.5 and -4 performance was significantly above the historical pass mark for MRCP Part 1, indicating they would likely pass this examination. ChatGPT-3.5 failed eight out of nine postgraduate exams it performed with an average percentage of 5.0% below the pass mark. ChatGPT-4 passed nine out of eleven postgraduate exams it performed with an average percentage of 13.56% above the pass mark. ChatGPT-4 performance was significantly better than ChatGPT-3.5 in all examinations that both models were tested on.

CONCLUSION

ChatGPT-4 performed at above passing level for the majority of UK postgraduate medical examinations it was tested on. ChatGPT is prone to hallucinations, fabrications and reduced explanation accuracy which could limit its potential as a learning tool. The potential for these errors is an inherent part of LLMs and may always be a limitation for medical applications of ChatGPT.

摘要

目的

作为一个基于大型数据集训练的大型语言模型(LLM),ChatGPT 无需额外培训即可执行广泛的任务。我们通过系统地综述 ChatGPT 在英国研究生医学评估中的表现及其在皇家内科医师学会会员(MRCP)第 1 部分考试中的表现,评估了 ChatGPT 在英国研究生医学考试中的表现。

方法

检索了 Medline、Embase 和 Cochrane 数据库。系统综述中纳入了讨论 ChatGPT 在英国研究生医学考试中表现的文章。提取了考试表现的信息,包括百分比分数和通过/失败率。将 MRCP UK 第 1 部分的样题问题插入 ChatGPT-3.5 和 -4 中各四次,并根据提供的正确答案进行打分。

结果

最终有 12 项研究被纳入系统文献综述。ChatGPT-3.5 在 MRCP 第 1 部分的样卷中得分为 66.4%,ChatGPT-4 得分为 84.8%,分别比历史及格分数高 4.4%和 22.8%。ChatGPT-3.5 和 -4 的表现均明显高于 MRCP 第 1 部分的历史及格分数,表明它们很可能通过这次考试。ChatGPT-3.5 在其参加的九项研究生考试中失败了八项,平均低于及格分数 5.0%。ChatGPT-4 在其参加的十一项研究生考试中通过了九项,平均高于及格分数 13.56%。在两个模型都参加的所有考试中,ChatGPT-4 的表现均明显优于 ChatGPT-3.5。

结论

ChatGPT-4 在其参加的大多数英国研究生医学考试中表现均达到及格水平以上。ChatGPT 容易出现幻觉、编造和降低解释准确性,这可能限制了它作为学习工具的潜力。这些错误的可能性是大型语言模型的固有部分,并且可能一直是 ChatGPT 在医学应用中的一个限制。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b78d/11290618/185e2bb9d736/pone.0307372.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验