评估ChatGPT在骨科住院医师培训考试中的表现。

Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination.

作者信息

Kung Justin E, Marshall Christopher, Gauthier Chase, Gonzalez Tyler A, Jackson J Benjamin

机构信息

Department of Orthopedic Surgery, Prisma Health-Midlands University of South Carolina, Columbia, South Carolina.

University of South Carolina School of Medicine, Columbia, South Carolina.

出版信息

JB JS Open Access. 2023 Sep 8;8(3). doi: 10.2106/JBJS.OA.23.00056. eCollection 2023 Jul-Sep.

DOI:10.2106/JBJS.OA.23.00056

PMID:37693092

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10484364/

Abstract

BACKGROUND

Artificial intelligence (AI) holds potential in improving medical education and healthcare delivery. ChatGPT is a state-of-the-art natural language processing AI model which has shown impressive capabilities, scoring in the top percentiles on numerous standardized examinations, including the Uniform Bar Exam and Scholastic Aptitude Test. The goal of this study was to evaluate ChatGPT performance on the Orthopaedic In-Training Examination (OITE), an assessment of medical knowledge for orthopedic residents.

METHODS

OITE 2020, 2021, and 2022 questions without images were inputted into ChatGPT version 3.5 and version 4 (GPT-4) with zero prompting. The performance of ChatGPT was evaluated as a percentage of correct responses and compared with the national average of orthopedic surgery residents at each postgraduate year (PGY) level. ChatGPT was asked to provide a source for its answer, which was categorized as being a journal article, book, or website, and if the source could be verified. Impact factor for the journal cited was also recorded.

RESULTS

ChatGPT answered 196 of 360 answers correctly (54.3%), corresponding to a PGY-1 level. ChatGPT cited a verifiable source in 47.2% of questions, with an average median journal impact factor of 5.4. GPT-4 answered 265 of 360 questions correctly (73.6%), corresponding to the average performance of a PGY-5 and exceeding the corresponding passing score for the American Board of Orthopaedic Surgery Part I Examination of 67%. GPT-4 cited a verifiable source in 87.9% of questions, with an average median journal impact factor of 5.2.

CONCLUSIONS

ChatGPT performed above the average PGY-1 level and GPT-4 performed better than the average PGY-5 level, showing major improvement. Further investigation is needed to determine how successive versions of ChatGPT would perform and how to optimize this technology to improve medical education.

CLINICAL RELEVANCE

AI has the potential to aid in medical education and healthcare delivery.

摘要

背景

人工智能（AI）在改善医学教育和医疗服务方面具有潜力。ChatGPT是一种先进的自然语言处理人工智能模型，已展现出令人印象深刻的能力，在众多标准化考试中得分位居前百分位，包括统一律师考试和学术能力评估测试。本研究的目的是评估ChatGPT在骨科住院医师培训考试（OITE）中的表现，这是一项针对骨科住院医师医学知识的评估。

方法

将2020年、2021年和2022年无图像的OITE问题输入ChatGPT 3.5版本和版本4（GPT - 4），无任何提示。ChatGPT的表现以正确回答的百分比来评估，并与各研究生年级（PGY）水平的骨科手术住院医师全国平均水平进行比较。要求ChatGPT为其答案提供来源，该来源被归类为期刊文章、书籍或网站，以及该来源是否可核实。还记录了所引用期刊的影响因子。

结果

ChatGPT在360个答案中正确回答了196个（54.3%），相当于PGY - 1水平。ChatGPT在47.2%的问题中引用了可核实的来源，所引用期刊的平均影响因子中位数为5.4。GPT - 4在360个问题中正确回答了265个（73.6%），相当于PGY - 5的平均表现，超过了美国骨科医师委员会第一部分考试67%的相应及格分数。GPT - 4在87.9%的问题中引用了可核实的来源，所引用期刊的平均影响因子中位数为5.2。

结论

ChatGPT的表现高于PGY - 1平均水平，GPT - 4表现优于PGY - 5平均水平，显示出重大进步。需要进一步研究以确定ChatGPT的后续版本表现如何，以及如何优化这项技术以改善医学教育。

临床意义

人工智能有潜力辅助医学教育和医疗服务。

相似文献

Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination.评估ChatGPT在骨科住院医师培训考试中的表现。

JB JS Open Access. 2023 Sep 8;8(3). doi: 10.2106/JBJS.OA.23.00056. eCollection 2023 Jul-Sep.

Artificial Intelligence in Orthopaedics: Performance of ChatGPT on Text and Image Questions on a Complete AAOS Orthopaedic In-Training Examination (OITE).人工智能在骨科领域的应用：ChatGPT 在 AAOS 骨科住院医师培训考试（OITE）全题文本和图像问题上的表现。

J Surg Educ. 2024 Nov;81(11):1645-1649. doi: 10.1016/j.jsurg.2024.08.002. Epub 2024 Sep 14.

Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.人工智能能通过美国骨科医师学会考试吗？骨科住院医师与ChatGPT的对比。

Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.

Comparison of Artificial Intelligence to Resident Performance on Upper-Extremity Orthopaedic In-Training Examination Questions.人工智能与住院医师在上肢骨科培训考试问题上表现的比较。

J Hand Surg Glob Online. 2023 Dec 11;6(2):164-168. doi: 10.1016/j.jhsg.2023.10.013. eCollection 2024 Mar.

Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations.ChatGPT-3.5、ChatGPT-4 和骨科住院医师在骨科评估考试中的表现比较。

J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4.

A Comparison Between GPT-3.5, GPT-4, and GPT-4V: Can the Large Language Model (ChatGPT) Pass the Japanese Board of Orthopaedic Surgery Examination?GPT-3.5、GPT-4和GPT-4V之间的比较：大型语言模型（ChatGPT）能通过日本骨科手术委员会考试吗？

Cureus. 2024 Mar 18;16(3):e56402. doi: 10.7759/cureus.56402. eCollection 2024 Mar.

Inadequate Performance of ChatGPT on Orthopedic Board-Style Written Exams.ChatGPT在骨科委员会风格笔试中的表现不佳。

Cureus. 2024 Jun 18;16(6):e62643. doi: 10.7759/cureus.62643. eCollection 2024 Jun.

Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level.生成式人工智能的表现达到了骨科住院医师二年级的水平。

Cureus. 2024 Mar 13;16(3):e56104. doi: 10.7759/cureus.56104. eCollection 2024 Mar.

Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study.ChatGPT 在不同考试级别的眼科相关问题上的表现：观察性研究。

JMIR Med Educ. 2024 Jan 18;10:e50842. doi: 10.2196/50842.

ChatGPT Performs at the Level of a Third-Year Orthopaedic Surgery Resident on the Orthopaedic In-Training Examination.ChatGPT在骨科住院医师培训考试中的表现相当于一名三年级骨科住院医师的水平。

JB JS Open Access. 2023 Dec 11;8(4). doi: 10.2106/JBJS.OA.23.00103. eCollection 2023 Oct-Dec.

引用本文的文献

Systematic Review on Large Language Models in Orthopaedic Surgery.骨科手术中大型语言模型的系统评价

J Clin Med. 2025 Aug 20;14(16):5876. doi: 10.3390/jcm14165876.

Performance of AI Models vs. Orthopedic Residents in Turkish Specialty Training Development Exams in Orthopedics.人工智能模型与土耳其骨科专科培训发展考试中骨科住院医师的表现对比。

Sisli Etfal Hastan Tip Bul. 2025 Feb 7;59(2):151-155. doi: 10.14744/SEMB.2025.65289. eCollection 2025.

Large language models versus traditional textbooks: optimizing learning for plastic surgery case preparation.大型语言模型与传统教科书：优化整形手术病例准备的学习

BMC Med Educ. 2025 Jul 1;25(1):984. doi: 10.1186/s12909-025-07550-8.

Assessing Accuracy of Chat Generative Pre-Trained Transformer's Responses to Common Patient Questions Regarding Congenital Upper Limb Differences.评估聊天生成预训练变换器对有关先天性上肢差异的常见患者问题的回答准确性。

J Hand Surg Glob Online. 2025 May 31;7(4):100764. doi: 10.1016/j.jhsg.2025.100764. eCollection 2025 Jul.

Performance of ChatGPT on the Plastic Surgery In-Training Examination.ChatGPT在整形外科住院医师培训考试中的表现。

Eplasty. 2024 Dec 18;24:e68. eCollection 2024.

An Assessment of the Accuracy and Consistency of ChatGPT in the Management of Midshaft Clavicle Fractures.ChatGPT在锁骨中段骨折管理中的准确性和一致性评估。

Cureus. 2025 Apr 8;17(4):e81906. doi: 10.7759/cureus.81906. eCollection 2025 Apr.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性：系统评价与网络荟萃分析

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination.评估GPT-3.5、GPT-4和GPT-4o在中国国家医师资格考试中的表现。

Sci Rep. 2025 Apr 23;15(1):14119. doi: 10.1038/s41598-025-98949-2.

Assessing the performance of ChatGPT-4o on the Turkish Orthopedics and Traumatology Board Examination.评估ChatGPT-4o在土耳其骨科学与创伤学委员会考试中的表现。

Jt Dis Relat Surg. 2025 Apr 5;36(2):304-310. doi: 10.52312/jdrs.2025.1958.

EVALUATION OF THE PERFORMANCE OF CHATGPT/ARTIFICIAL INTELLIGENCE IN THE MULTIPLE-CHOICE TEST TO OBTAIN THE TITLE OF SPECIALIST IN ORTHOPEDICS AND TRAUMATOLOGY.评估ChatGPT/人工智能在获得骨科与创伤学专家头衔的多项选择题测试中的表现。

Acta Ortop Bras. 2025 Apr 7;33(spe1):e280947. doi: 10.1590/1413-785220243201e280947. eCollection 2025.

本文引用的文献

GPT-4 passes the bar exam.GPT-4通过了律师资格考试。

Philos Trans A Math Phys Eng Sci. 2024 Apr 15;382(2270):20230254. doi: 10.1098/rsta.2023.0254. Epub 2024 Feb 26.

Artificial intelligence in orthopaedics surgery: transforming technological innovation in patient care and surgical training.人工智能在骨科手术中的应用：改变患者护理和外科手术培训中的技术创新。

Postgrad Med J. 2023 Jun 30;99(1173):687-694. doi: 10.1136/postgradmedj-2022-141596.

Performance of ChatGPT on Specialty Certificate Examination in Dermatology multiple-choice questions.ChatGPT 在皮肤病学多选题专业证书考试中的表现。

Clin Exp Dermatol. 2024 Jun 25;49(7):722-727. doi: 10.1093/ced/llad197.

Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.

ChatGPT Is Equivalent to First-Year Plastic Surgery Residents: Evaluation of ChatGPT on the Plastic Surgery In-Service Examination.ChatGPT 相当于第一年整形外科住院医师：ChatGPT 在整形外科住院医师年度考核中的评估。

Aesthet Surg J. 2023 Nov 16;43(12):NP1085-NP1089. doi: 10.1093/asj/sjad130.

Performance of ChatGPT on the Plastic Surgery Inservice Training Examination.ChatGPT 在整形外科学在职培训考试中的表现。

Aesthet Surg J. 2023 Nov 16;43(12):NP1078-NP1082. doi: 10.1093/asj/sjad128.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现：使用大语言模型进行人工智能辅助医学教育的潜力。

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试（USMLE）中的表现如何？大语言模型对医学教育和知识评估的影响。

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

ChatGPT and Other Large Language Models Are Double-edged Swords.ChatGPT和其他大型语言模型是双刃剑。

Radiology. 2023 Apr;307(2):e230163. doi: 10.1148/radiol.230163. Epub 2023 Jan 26.

Academic Radiology Departments Should Lead Artificial Intelligence Initiatives.学术放射科应引领人工智能计划。

Acad Radiol. 2023 May;30(5):971-974. doi: 10.1016/j.acra.2022.07.011. Epub 2022 Aug 11.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验