Acad Med. 2024 Feb 1;99(2):192-197. doi: 10.1097/ACM.0000000000005549. Epub 2023 Nov 7.
In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT.
As with earlier investigations, analyses were based on publicly available USMLE sample items. Each item was submitted to ChatGPT (version 3.5) 3 times to evaluate stability. Responses were scored following rules that match operational practice, and a preliminary analysis explored the characteristics of items that ChatGPT answered correctly. The study was conducted between February and March 2023.
For the full sample of items, ChatGPT scored above 60% correct except for one replication for Step 3. Response success varied across replications for 76 items (20%). There was a modest correspondence with item difficulty wherein ChatGPT was more likely to respond correctly to items found easier by examinees. ChatGPT performed significantly worse ( P < .001) on items relating to practice-based learning.
Achieving 60% accuracy is an approximate indicator of meeting the passing standard, requiring statistical adjustments for comparison. Hence, this assessment can only suggest consistency with the passing standards for Steps 1 and 2 Clinical Knowledge, with further limitations in extrapolating this inference to Step 3. These limitations are due to variances in item difficulty and exclusion of the simulation component of Step 3 from the evaluation-limitations that would apply to any AI system evaluated on the Step 3 sample items. It is crucial to note that responses from large language models exhibit notable variations when faced with repeated inquiries, underscoring the need for expert validation to ensure their utility as a learning tool.
2022 年末至 2023 年初,有关 ChatGPT 可以通过美国医师执照考试(USMLE)的报道引起了极大的关注,媒体的反应表明 ChatGPT 具有可信的医学知识。本报告分析了人工智能(AI)代理在这些样本项目上的表现能够在多大程度上推广到实际 USMLE 考试的表现,并以 ChatGPT 为例进行了说明。
与早期的研究一样,分析基于公开的 USMLE 样本项目。每个项目都向 ChatGPT(版本 3.5)提交 3 次,以评估其稳定性。根据与实际操作相匹配的规则对回复进行评分,初步分析探讨了 ChatGPT 回答正确的项目的特征。该研究于 2023 年 2 月至 3 月进行。
对于整个项目样本,ChatGPT 的正确率除了第三次测试的 Step 3 外均高于 60%。对于 76 个项目(20%),回复的成功率在不同的重复测试中有所不同。ChatGPT 的回复成功率与项目难度之间存在一定的对应关系,即它更有可能回答考生认为更容易的项目。ChatGPT 在与基于实践的学习相关的项目上表现明显更差(P<.001)。
达到 60%的准确率是符合及格标准的近似指标,需要进行统计调整才能进行比较。因此,这种评估只能表明在 Step 1 和 2 临床知识方面符合及格标准,而在推断到 Step 3 时存在进一步的限制。这些限制是由于项目难度的差异以及 Step 3 的模拟部分未包含在评估中的限制所致,这些限制也适用于在 Step 3 样本项目上评估的任何 AI 系统。需要注意的是,大型语言模型在面对重复查询时会表现出明显的差异,这凸显了需要专家验证来确保它们作为学习工具的实用性。