Suppr超能文献

ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

作者信息

Gilson Aidan, Safranek Conrad W, Huang Thomas, Socrates Vimig, Chi Ling, Taylor Richard Andrew, Chartash David

机构信息

Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States.

Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, United States.

出版信息

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

Abstract

BACKGROUND

Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input.

OBJECTIVE

This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability.

METHODS

We used 2 sets of multiple-choice questions to evaluate ChatGPT's performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT's performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question.

RESULTS

Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT's answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively.

CONCLUSIONS

ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT's capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.

摘要

背景

聊天生成预训练变换器(ChatGPT)是一个拥有1750亿参数的自然语言处理模型,能够根据用户输入生成对话式回复。

目的

本研究旨在评估ChatGPT在美国医师执照考试(USMLE)第一步和第二步考试范围内问题的表现,并分析回复以便用户理解。

方法

我们使用两组多项选择题来评估ChatGPT的表现,每组都包含与第一步和第二步相关的问题。第一组问题来自AMBOSS,这是医学生常用的题库,它还提供问题难度以及相对于用户群体的考试表现统计数据。第二组是美国国家医学考试委员会(NBME)的免费120道题。将ChatGPT的表现与另外两个大语言模型GPT - 3和InstructGPT进行比较。根据三个定性指标评估ChatGPT每个回复的文本输出:所选答案的逻辑依据、问题内部信息的呈现以及问题外部信息的呈现。

结果

在四个数据集AMBOSS - 第一步、AMBOSS - 第二步、NBME - 免费 - 第一步和NBME - 免费 - 第二步中,ChatGPT的准确率分别为44%(44/100)、42%(42/100)、64.4%(56/87)和57.8%(59/102)。在所有数据集中,ChatGPT的平均表现比InstructGPT高出8.15%,GPT - 3的表现与随机猜测相近。在AMBOSS - 第一步数据集中,随着问题难度增加,该模型的表现显著下降(P = 0.01)。我们发现,在NBME数据集的所有输出中,100%都有ChatGPT答案选择的逻辑依据。问题的内部信息在所有问题的96.8%(183/189)中出现。在NBME - 免费 - 第一步(P < 0.001)和NBME - 免费 - 第二步(P = 0.001)数据集中,错误答案中问题外部信息的出现率分别比正确答案低44.5%和27%。

结论

ChatGPT在医学问答任务的自然语言处理模型方面取得了显著进步。通过在NBME - 免费 - 第一步数据集上达到超过60%的阈值,我们表明该模型达到了三年级医学生的及格分数。此外,我们强调了ChatGPT在大多数答案中提供逻辑和信息背景的能力。综合这些事实,有力地证明了ChatGPT作为支持学习的交互式医学教育工具的潜在应用价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4156/9947764/03383b67496f/mededu_v9i1e45312_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验