ChatGPT在骨科在职培训考试中的表现：GPT-3.5 turbo和GPT-4模型在骨科教育中的比较研究。

The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education.

作者信息

Rizzo Michael G, Cai Nathan, Constantinescu David

机构信息

University of Miami Hospital, Department of Orthopaedic Surgery, 1611 NW 12th Ave #303, Miami, FL, 33136, USA.

The University of Miami Leonard M. Miller School of Medicine, Department of Education, 1600 NW 10th Ave #1140, Miami, FL, 33136, USA.

出版信息

J Orthop. 2023 Nov 23;50:70-75. doi: 10.1016/j.jor.2023.11.056. eCollection 2024 Apr.

DOI:10.1016/j.jor.2023.11.056

PMID:38173829

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10758621/

Abstract

INTRODUCTION

The rapid advancement of artificial intelligence (AI), particularly the development of Large Language Models (LLMs) such as Generative Pretrained Transformers (GPTs), has revolutionized numerous fields. The purpose of this study is to investigate the application of LLMs within the realm of orthopaedic in training examinations.

METHODS

Questions from the 2020-2022 Orthopaedic In-Service Training Exams (OITEs) were given to OpenAI's GPT-3.5 Turbo and GPT-4 LLMs, using a zero-shot inference approach. Each model was given a multiple-choice question, without prior exposure to similar queries, and their generated responses were compared to the correct answer within each OITE. The models were evaluated on overall accuracy, performance on questions with and without media, and performance on first- and higher-order questions.

RESULTS

The GPT-4 model outperformed the GPT-3.5 Turbo model across all years and question categories (2022: 67.63% vs. 50.24%; 2021: 58.69% vs. 47.42%; 2020: 59.53% vs. 46.51%). Both models showcased better performance with questions devoid of associated media, with GPT-4 attaining accuracies of 68.80%, 65.14%, and 68.22% for 2022, 2021, and 2020, respectively. GPT-4 outscored GPT-3.5 Turbo on first-order questions across all years (2022: 63.83% vs. 38.30%; 2021: 57.45% vs. 50.00%; 2020: 65.74% vs. 53.70%). GPT-4 also outscored GPT-3.5 Turbo on higher-order questions across all years (2022: 68.75% vs. 53.75%; 2021: 59.66% vs. 45.38%; 2020: 53.27% vs. 39.25%).

DISCUSSION

GPT-4 showed improved performance compared to GPT-3.5 Turbo in all tested categories. The results reflect the potential and limitations of AI in orthopaedics. GPT-4's performance is comparable to a second-to-third-year resident and GPT-3.5 Turbo's performance is comparable to a first-year resident, suggesting the application of current LLMs can neither pass the OITE nor substitute orthopaedic training. This study sets a precedent for future endeavors integrating GPT models into orthopaedic education and underlines the necessity for specialized training of these models for specific medical domains.

摘要

引言

人工智能（AI）的迅速发展，尤其是生成式预训练变换器（GPT）等大语言模型（LLM）的发展，已经彻底改变了众多领域。本研究的目的是调查大语言模型在骨科培训考试领域的应用。

方法

采用零样本推理方法，将2020 - 2022年骨科在职培训考试（OITE）的问题提供给OpenAI的GPT - 3.5 Turbo和GPT - 4大语言模型。每个模型被给予一道多项选择题，事先未接触过类似问题，然后将它们生成的答案与每个OITE中的正确答案进行比较。对模型在总体准确性、有无媒体问题的表现以及一阶和高阶问题的表现进行评估。

结果

在所有年份和问题类别中，GPT - 4模型的表现均优于GPT - 3.5 Turbo模型（2022年：67.63%对50.24%；2021年：58.69%对47.42%；2020年：59.53%对46.51%）。两个模型在没有相关媒体的问题上表现更好，2022年、2021年和2020年GPT - 4的准确率分别为68.80%、65.14%和68.22%。在所有年份的一阶问题上，GPT - 4的得分均高于GPT - 3.5 Turbo（2022年：63.83%对38.30%；2021年：57.45%对50.00%；2020年：65.74%对53.70%）。在所有年份的高阶问题上，GPT - 4的得分也高于GPT - 3.5 Turbo（2022年：68.75%对53.75%；2021年：59.66%对45.38%；2020年：53.27%对39.25%）。

讨论

与GPT - 3.5 Turbo相比，GPT - 4在所有测试类别中表现出更好的性能。结果反映了人工智能在骨科领域的潜力和局限性。GPT - 4的表现与二至三年级住院医师相当，GPT - 3.5 Turbo的表现与一年级住院医师相当，这表明当前的大语言模型既不能通过OITE考试，也不能替代骨科培训。本研究为未来将GPT模型整合到骨科教育中的努力树立了先例，并强调了针对特定医学领域对这些模型进行专门训练的必要性。

相似文献

The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education.

J Orthop. 2023 Nov 23;50:70-75. doi: 10.1016/j.jor.2023.11.056. eCollection 2024 Apr.

The Rapid Development of Artificial Intelligence: GPT-4's Performance on Orthopedic Surgery Board Questions.

Orthopedics. 2024 Mar-Apr;47(2):e85-e89. doi: 10.3928/01477447-20230922-05. Epub 2023 Sep 27.

Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces.

Eur Arch Otorhinolaryngol. 2025 Apr 25. doi: 10.1007/s00405-025-09404-x.

Large language models (LLMs) in radiology exams for medical students: Performance and consequences.

Rofo. 2024 Nov 4. doi: 10.1055/a-2437-2067.

Artificial Intelligence in Orthopaedics: Performance of ChatGPT on Text and Image Questions on a Complete AAOS Orthopaedic In-Training Examination (OITE).

J Surg Educ. 2024 Nov;81(11):1645-1649. doi: 10.1016/j.jsurg.2024.08.002. Epub 2024 Sep 14.

Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination.

J Orthop Surg (Hong Kong). 2025 Jan-Apr;33(1):10225536241268789. doi: 10.1177/10225536241268789.

Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany.

JMIR Med Educ. 2023 Sep 4;9:e46482. doi: 10.2196/46482.

Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.

JMIR Med Educ. 2024 Jul 8;10:e51282. doi: 10.2196/51282.

Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations.

J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4.

Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.

JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.

引用本文的文献

Systematic Review on Large Language Models in Orthopaedic Surgery.

J Clin Med. 2025 Aug 20;14(16):5876. doi: 10.3390/jcm14165876.

Evaluating the accuracy of CHATGPT models in answering multiple-choice questions on oral and maxillofacial pathologies and oral radiology.

Digit Health. 2025 Jul 8;11:20552076251355847. doi: 10.1177/20552076251355847. eCollection 2025 Jan-Dec.

Modern artificial intelligence and large language models in graduate medical education: a scoping review of attitudes, applications & practice.

BMC Med Educ. 2025 May 20;25(1):730. doi: 10.1186/s12909-025-07321-5.

Evaluating retrieval augmented generation and ChatGPT's accuracy on orthopaedic examination assessment questions.

Ann Jt. 2025 Apr 22;10:12. doi: 10.21037/aoj-24-49. eCollection 2025.

Answering Patterns in SBA Items: Students, GPT3.5, and Gemini.

Med Sci Educ. 2024 Nov 26;35(2):629-632. doi: 10.1007/s40670-024-02232-4. eCollection 2025 Apr.

Exploring the role of artificial intelligence in Turkish orthopedic progression exams.

Acta Orthop Traumatol Turc. 2025 Mar 17;59(1):18-26. doi: 10.5152/j.aott.2025.24090.

Exploring the Current Applications of Artificial Intelligence in Orthopaedic Surgical Training: A Systematic Scoping Review.

Cureus. 2025 Apr 3;17(4):e81671. doi: 10.7759/cureus.81671. eCollection 2025 Apr.

Semantic Clinical Artificial Intelligence vs Native Large Language Model Performance on the USMLE.

JAMA Netw Open. 2025 Apr 1;8(4):e256359. doi: 10.1001/jamanetworkopen.2025.6359.

ChatGPT 4.0's efficacy in the self-diagnosis of non-traumatic hand conditions.

J Hand Microsurg. 2025 Jan 23;17(3):100217. doi: 10.1016/j.jham.2025.100217. eCollection 2025 May.

Evaluating ChatGPT-4o as a decision support tool in multidisciplinary sarcoma tumor boards: heterogeneous performance across various specialties.

Front Oncol. 2025 Jan 17;14:1526288. doi: 10.3389/fonc.2024.1526288. eCollection 2024.

本文引用的文献

GPT-4 passes the bar exam.

Philos Trans A Math Phys Eng Sci. 2024 Apr 15;382(2270):20230254. doi: 10.1098/rsta.2023.0254. Epub 2024 Feb 26.

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.

Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15.

ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story?

Eur Heart J Digit Health. 2023 Apr 24;4(3):279-281. doi: 10.1093/ehjdh/ztad029. eCollection 2023 May.

Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination.

Eye (Lond). 2023 Dec;37(17):3694-3695. doi: 10.1038/s41433-023-02564-2. Epub 2023 May 8.

ChatGPT Is Equivalent to First-Year Plastic Surgery Residents: Evaluation of ChatGPT on the Plastic Surgery In-Service Examination.

Aesthet Surg J. 2023 Nov 16;43(12):NP1085-NP1089. doi: 10.1093/asj/sjad130.

The rise of ChatGPT: Exploring its potential in medical education.

Anat Sci Educ. 2024 Jul-Aug;17(5):926-931. doi: 10.1002/ase.2270. Epub 2023 Mar 28.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

Initial impressions of ChatGPT for anatomy education.

Anat Sci Educ. 2024 Mar;17(2):444-447. doi: 10.1002/ase.2261. Epub 2023 Feb 14.

The Orthopaedic In-Training Examination (OITE).

Clin Orthop Relat Res. 1971 Mar-Apr;75:108-16. doi: 10.1097/00003086-197103000-00014.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ChatGPT在骨科在职培训考试中的表现：GPT-3.5 turbo和GPT-4模型在骨科教育中的比较研究。

The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

DISCUSSION

引言

方法

结果

讨论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献