• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

分析影响ChatGPT在3000道美国医师执照考试(USMLE®)风格题目中表现的问题特征

Analyzing Question Characteristics Influencing ChatGPT's Performance in 3000 USMLE®-Style Questions.

作者信息

Alfertshofer Michael, Knoedler Samuel, Hoch Cosima C, Cotofana Sebastian, Panayi Adriana C, Kauke-Navarro Martin, Tullius Stefan G, Orgill Dennis P, Austen William G, Pomahac Bohdan, Knoedler Leonard

机构信息

Department of Oral and Maxillofacial Surgery, Ludwig-Maximilians-University Munich, Munich, Germany.

Department of Plastic Surgery and Hand Surgery, Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany.

出版信息

Med Sci Educ. 2024 Sep 28;35(1):257-267. doi: 10.1007/s40670-024-02176-9. eCollection 2025 Feb.

DOI:10.1007/s40670-024-02176-9
PMID:40144074
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11933601/
Abstract

BACKGROUND

The potential of artificial intelligence (AI) and large language models like ChatGPT in medical applications is promising, yet its performance requires comprehensive evaluation. This study assessed ChatGPT's capabilities in answering USMLE® Step 2CK questions, analyzing its performance across medical specialties, question types, and difficulty levels in a large-scale question test set to assist question writers in developing AI-resistant exam questions and provide medical students with a realistic understanding of how AI can enhance their active learning.

MATERIALS AND METHODS

A total of =3302 USMLE® Step 2CK practice questions were extracted from the AMBOSS© study platform, excluding 302 image-based questions, leaving 3000 text-based questions for analysis. Questions were manually entered into ChatGPT and its accuracy and performance across various categories and difficulties were evaluated.

RESULTS

ChatGPT answered 57.7% of all questions correctly. Highest performance scores were found in the category "Male Reproductive System" (71.7%) while the lowest were found in the category "Immune System" (46.3%). Lower performance was noted in table-based questions, and a negative correlation was found between question difficulty and performance ( =-0.285, <0.001). Longer questions tended to be answered incorrectly more often ( =-0.076, <0.001), with a significant difference in length of correctly versus incorrectly answered questions.

CONCLUSION

ChatGPT demonstrated proficiency close to the passing threshold for USMLE® Step 2CK. Performance varied by category, question type, and difficulty. These findings aid medical educators make their exams more AI-proof and inform the integration of AI tools like ChatGPT into teaching strategies. For students, understanding the model's limitations and capabilities ensures it is used as an auxiliary resource to foster active learning rather than abusing it as a study replacement. This study highlights the need for further refinement and improvement in AI models for medical education and decision-making.

摘要

背景

人工智能(AI)以及像ChatGPT这样的大语言模型在医学应用中的潜力巨大,但仍需对其性能进行全面评估。本研究评估了ChatGPT回答美国医师执照考试(USMLE®)第二步临床知识(Step 2CK)问题的能力,在一个大规模问题测试集中分析其在各个医学专业、问题类型和难度水平上的表现,以帮助出题者编写对AI有抗性的考试题目,并让医学生切实了解AI如何增强他们的主动学习。

材料与方法

从AMBOSS©学习平台提取了总共3302道USMLE® Step 2CK练习题,排除302道基于图像的题目,剩下3000道基于文本的题目用于分析。将题目手动输入ChatGPT,并评估其在各类别和难度下的准确性和表现。

结果

ChatGPT正确回答了所有问题的57.7%。在“男性生殖系统”类别中表现得分最高(71.7%),而在“免疫系统”类别中得分最低(46.3%)。基于表格的问题表现较低,并且发现问题难度与表现之间存在负相关(r = -0.285,P < 0.001)。较长的问题往往更常被答错(r = -0.076,P < 0.001),正确回答与错误回答的问题长度存在显著差异。

结论

ChatGPT在USMLE® Step 2CK中的表现接近及格门槛。表现因类别、问题类型和难度而异。这些发现有助于医学教育工作者使他们的考试更具抗AI能力,并为将ChatGPT等AI工具融入教学策略提供参考。对于学生而言,了解该模型的局限性和能力可确保将其用作促进主动学习的辅助资源,而不是将其滥用作学习替代品。本研究强调了在医学教育和决策的AI模型方面进一步优化和改进的必要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/4c00713156b4/40670_2024_2176_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/27c5816b6f70/40670_2024_2176_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/9f61bbfffa94/40670_2024_2176_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/9ef91698baed/40670_2024_2176_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/857b17388e94/40670_2024_2176_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/f9f322f25d70/40670_2024_2176_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/3113b9350d9a/40670_2024_2176_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/4c00713156b4/40670_2024_2176_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/27c5816b6f70/40670_2024_2176_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/9f61bbfffa94/40670_2024_2176_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/9ef91698baed/40670_2024_2176_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/857b17388e94/40670_2024_2176_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/f9f322f25d70/40670_2024_2176_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/3113b9350d9a/40670_2024_2176_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b8a/11933601/4c00713156b4/40670_2024_2176_Fig7_HTML.jpg

相似文献

1
Analyzing Question Characteristics Influencing ChatGPT's Performance in 3000 USMLE®-Style Questions.分析影响ChatGPT在3000道美国医师执照考试(USMLE®)风格题目中表现的问题特征
Med Sci Educ. 2024 Sep 28;35(1):257-267. doi: 10.1007/s40670-024-02176-9. eCollection 2025 Feb.
2
Advancements in AI Medical Education: Assessing ChatGPT's Performance on USMLE-Style Questions Across Topics and Difficulty Levels.人工智能医学教育的进展:评估ChatGPT在不同主题和难度级别的美国医师执照考试(USMLE)风格问题上的表现。
Cureus. 2024 Dec 24;16(12):e76309. doi: 10.7759/cureus.76309. eCollection 2024 Dec.
3
Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.纯粹的智慧还是虚假的村庄?对 USMLE Step 3 题型的 ChatGPT 3.5 和 ChatGPT 4 的比较:定量分析。
JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.
4
In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions.基于 2377 个美国医师执照考试(USMLE)第 1 步风格问题题干中的特定信号词和短语,深入分析 ChatGPT 的表现。
Sci Rep. 2024 Jun 12;14(1):13553. doi: 10.1038/s41598-024-63997-7.
5
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
6
ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination.ChatGPT-4:美国医师执照考试中人工智能聊天机器人的升级评估。
Med Teach. 2024 Mar;46(3):366-372. doi: 10.1080/0142159X.2023.2249588. Epub 2023 Oct 15.
7
ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice.ChatGPT在德国妇产科考试中的表现——为人工智能强化医学教育和临床实践铺平道路。
Front Med (Lausanne). 2023 Dec 13;10:1296615. doi: 10.3389/fmed.2023.1296615. eCollection 2023.
8
Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study.ChatGPT 在不同考试级别的眼科相关问题上的表现:观察性研究。
JMIR Med Educ. 2024 Jan 18;10:e50842. doi: 10.2196/50842.
9
Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.评估问题特征对 ChatGPT 表现和回应解释一致性的影响:来自台湾护理执照考试的见解。
Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8.
10
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.揭示GPT-4V在美国医师执照考试(USMLE)问题上高精度背后的隐藏挑战:观察性研究。
J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.

本文引用的文献

1
In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions.基于 2377 个美国医师执照考试(USMLE)第 1 步风格问题题干中的特定信号词和短语,深入分析 ChatGPT 的表现。
Sci Rep. 2024 Jun 12;14(1):13553. doi: 10.1038/s41598-024-63997-7.
2
ChatGPT's Response Consistency: A Study on Repeated Queries of Medical Examination Questions.ChatGPT的回答一致性:关于医学考试问题重复查询的研究
Eur J Investig Health Psychol Educ. 2024 Mar 8;14(3):657-668. doi: 10.3390/ejihpe14030043.
3
Trends in the Main Residency Match From 2007 to 2020.
2007年至2020年主要住院医师匹配趋势。
Cureus. 2024 Feb 10;16(2):e53968. doi: 10.7759/cureus.53968. eCollection 2024 Feb.
4
Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.纯粹的智慧还是虚假的村庄?对 USMLE Step 3 题型的 ChatGPT 3.5 和 ChatGPT 4 的比较:定量分析。
JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.
5
Advancing Patient Care: How Artificial Intelligence Is Transforming Healthcare.推进患者护理:人工智能如何改变医疗保健。
J Pers Med. 2023 Jul 31;13(8):1214. doi: 10.3390/jpm13081214.
6
The importance of USMLE step 2 on the screening and selection of applicants for general surgery residency positions.美国医师执照考试第二步(USMLE step 2)在普通外科住院医师职位申请人筛选和选拔中的重要性。
Heliyon. 2023 Jun 27;9(7):e17486. doi: 10.1016/j.heliyon.2023.e17486. eCollection 2023 Jul.
7
Practical Applications of ChatGPT in Undergraduate Medical Education.ChatGPT在本科医学教育中的实际应用
J Med Educ Curric Dev. 2023 May 24;10:23821205231178449. doi: 10.1177/23821205231178449. eCollection 2023 Jan-Dec.
8
The rise of ChatGPT: Exploring its potential in medical education.ChatGPT 的兴起:探索其在医学教育中的潜力。
Anat Sci Educ. 2024 Jul-Aug;17(5):926-931. doi: 10.1002/ase.2270. Epub 2023 Mar 28.
9
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.
10
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.