基于 2377 个美国医师执照考试（USMLE）第 1 步风格问题题干中的特定信号词和短语，深入分析 ChatGPT 的表现。

In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions.

机构信息

Department of Oral and Maxillofacial Surgery, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität Zu Berlin, and Berlin Institute of Health, Berlin, Germany.

Department of Plastic Surgery and Hand Surgery, Klinikum Rechts Der Isar, Technical University of Munich, Munich, Germany.

出版信息

Sci Rep. 2024 Jun 12;14(1):13553. doi: 10.1038/s41598-024-63997-7.

DOI:10.1038/s41598-024-63997-7

PMID:38866891

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11169536/

Abstract

ChatGPT has garnered attention as a multifaceted AI chatbot with potential applications in medicine. Despite intriguing preliminary findings in areas such as clinical management and patient education, there remains a substantial knowledge gap in comprehensively understanding the chances and limitations of ChatGPT's capabilities, especially in medical test-taking and education. A total of n = 2,729 USMLE Step 1 practice questions were extracted from the Amboss question bank. After excluding 352 image-based questions, a total of 2,377 text-based questions were further categorized and entered manually into ChatGPT, and its responses were recorded. ChatGPT's overall performance was analyzed based on question difficulty, category, and content with regards to specific signal words and phrases. ChatGPT achieved an overall accuracy rate of 55.8% in a total number of n = 2,377 USMLE Step 1 preparation questions obtained from the Amboss online question bank. It demonstrated a significant inverse correlation between question difficulty and performance with r = -0.306; p < 0.001, maintaining comparable accuracy to the human user peer group across different levels of question difficulty. Notably, ChatGPT outperformed in serology-related questions (61.1% vs. 53.8%; p = 0.005) but struggled with ECG-related content (42.9% vs. 55.6%; p = 0.021). ChatGPT achieved statistically significant worse performances in pathophysiology-related question stems. (Signal phrase = "what is the most likely/probable cause"). ChatGPT performed consistent across various question categories and difficulty levels. These findings emphasize the need for further investigations to explore the potential and limitations of ChatGPT in medical examination and education.

摘要

ChatGPT 作为一款多功能 AI 聊天机器人，在医学领域具有潜在应用，备受关注。尽管在临床管理和患者教育等领域有一些有趣的初步发现，但我们对全面了解 ChatGPT 能力的机会和限制仍存在很大的知识差距，尤其是在医学考试和教育方面。从 Amboss 题库中提取了总共 n = 2729 个 USMLE Step 1 练习题。排除 352 个基于图像的问题后，总共 2377 个基于文本的问题进一步进行分类并手动输入到 ChatGPT 中，并记录其回答。根据问题难度、类别以及特定信号词和短语的内容，分析了 ChatGPT 的整体表现。ChatGPT 在总共 n = 2377 个从 Amboss 在线题库获得的 USMLE Step 1 备考问题中的整体准确率为 55.8%。它表现出问题难度与表现之间存在显著的负相关关系，r = -0.306；p < 0.001，在不同难度级别的问题中，与人类用户的准确率相当。值得注意的是，ChatGPT 在血清学相关问题上表现更好（61.1%比 53.8%；p = 0.005），但在心电图相关内容方面表现不佳（42.9%比 55.6%；p = 0.021）。ChatGPT 在与病理生理学相关的问题中表现出显著更差的性能（信号短语=“最有可能/最可能的原因是什么”）。ChatGPT 在各种问题类别和难度级别上的表现一致。这些发现强调了需要进一步研究以探索 ChatGPT 在医学考试和教育中的潜力和限制。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c461/11169536/d8491bc4ca4b/41598_2024_63997_Fig1_HTML.jpg

相似文献

In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions.

Sci Rep. 2024 Jun 12;14(1):13553. doi: 10.1038/s41598-024-63997-7.

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.

JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

Sailing the Seven Seas: A Multinational Comparison of ChatGPT's Performance on Medical Licensing Examinations.

Ann Biomed Eng. 2024 Jun;52(6):1542-1545. doi: 10.1007/s10439-023-03338-3. Epub 2023 Aug 8.

Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.

JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.

ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice.

Front Med (Lausanne). 2023 Dec 13;10:1296615. doi: 10.3389/fmed.2023.1296615. eCollection 2023.

Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.

Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8.

Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.

JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.

Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports.

Sci Rep. 2024 Apr 23;14(1):9330. doi: 10.1038/s41598-024-58760-x.

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments.

Sci Rep. 2023 Oct 1;13(1):16492. doi: 10.1038/s41598-023-43436-9.

引用本文的文献

Performance of ChatGPT in answering the oral pathology questions of various types or subjects from Taiwan National Dental Licensing Examinations.

J Dent Sci. 2025 Jul;20(3):1709-1715. doi: 10.1016/j.jds.2025.03.030. Epub 2025 Apr 5.

Quantum leap in medical mentorship: exploring ChatGPT's transition from textbooks to terabytes.

Front Med (Lausanne). 2025 Apr 28;12:1517981. doi: 10.3389/fmed.2025.1517981. eCollection 2025.

Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces.

Eur Arch Otorhinolaryngol. 2025 Apr 25. doi: 10.1007/s00405-025-09404-x.

Precision Oncology in Non-small Cell Lung Cancer: A Comparative Study of Contextualized ChatGPT Models.

Cureus. 2025 Mar 24;17(3):e81097. doi: 10.7759/cureus.81097. eCollection 2025 Mar.

Analyzing Question Characteristics Influencing ChatGPT's Performance in 3000 USMLE®-Style Questions.

Med Sci Educ. 2024 Sep 28;35(1):257-267. doi: 10.1007/s40670-024-02176-9. eCollection 2025 Feb.

Advancements in AI Medical Education: Assessing ChatGPT's Performance on USMLE-Style Questions Across Topics and Difficulty Levels.

Cureus. 2024 Dec 24;16(12):e76309. doi: 10.7759/cureus.76309. eCollection 2024 Dec.

Understanding AI's Role in Endometriosis Patient Education and Evaluating Its Information and Accuracy: Systematic Review.

JMIR AI. 2024 Oct 30;3:e64593. doi: 10.2196/64593.

Assessment Study of ChatGPT-3.5's Performance on the Final Polish Medical Examination: Accuracy in Answering 980 Questions.

Healthcare (Basel). 2024 Aug 16;12(16):1637. doi: 10.3390/healthcare12161637.

本文引用的文献

GPT-4 passes the bar exam.

Philos Trans A Math Phys Eng Sci. 2024 Apr 15;382(2270):20230254. doi: 10.1098/rsta.2023.0254. Epub 2024 Feb 26.

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.

JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.

Diagnosing lagophthalmos using artificial intelligence.

Sci Rep. 2023 Dec 8;13(1):21657. doi: 10.1038/s41598-023-49006-3.

Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment.

Acad Med. 2024 Feb 1;99(2):192-197. doi: 10.1097/ACM.0000000000005549. Epub 2023 Nov 7.

Efficacy and limitations of ChatGPT as a biostatistical problem-solving tool in medical education in Serbia: a descriptive study.

J Educ Eval Health Prof. 2023;20:28. doi: 10.3352/jeehp.2023.20.28. Epub 2023 Oct 16.

The Importance of Research Experience With a Scoreless Step 1: A Student Survey at a Community-Based Medical School.

Cureus. 2023 Aug 14;15(8):e43476. doi: 10.7759/cureus.43476. eCollection 2023 Aug.

Advancing Patient Care: How Artificial Intelligence Is Transforming Healthcare.

J Pers Med. 2023 Jul 31;13(8):1214. doi: 10.3390/jpm13081214.

Sailing the Seven Seas: A Multinational Comparison of ChatGPT's Performance on Medical Licensing Examinations.

Ann Biomed Eng. 2024 Jun;52(6):1542-1545. doi: 10.1007/s10439-023-03338-3. Epub 2023 Aug 8.

ChatGPT Passes German State Examination in Medicine With Picture Questions Omitted.

Dtsch Arztebl Int. 2023 May 30;120(21):373-374. doi: 10.3238/arztebl.m2023.0113.

The importance of USMLE step 2 on the screening and selection of applicants for general surgery residency positions.

Heliyon. 2023 Jun 27;9(7):e17486. doi: 10.1016/j.heliyon.2023.e17486. eCollection 2023 Jul.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于 2377 个美国医师执照考试（USMLE）第 1 步风格问题题干中的特定信号词和短语，深入分析 ChatGPT 的表现。

In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions.

机构信息

Department of Plastic Surgery and Hand Surgery, Klinikum Rechts Der Isar, Technical University of Munich, Munich, Germany.

出版信息

Sci Rep. 2024 Jun 12;14(1):13553. doi: 10.1038/s41598-024-63997-7.

DOI:10.1038/s41598-024-63997-7

PMID:38866891

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11169536/

Abstract

摘要

基于 2377 个美国医师执照考试（USMLE）第 1 步风格问题题干中的特定信号词和短语，深入分析 ChatGPT 的表现。

In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

基于 2377 个美国医师执照考试（USMLE）第 1 步风格问题题干中的特定信号词和短语，深入分析 ChatGPT 的表现。

In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献