Rizzo Michael G, Cai Nathan, Constantinescu David
University of Miami Hospital, Department of Orthopaedic Surgery, 1611 NW 12th Ave #303, Miami, FL, 33136, USA.
The University of Miami Leonard M. Miller School of Medicine, Department of Education, 1600 NW 10th Ave #1140, Miami, FL, 33136, USA.
J Orthop. 2023 Nov 23;50:70-75. doi: 10.1016/j.jor.2023.11.056. eCollection 2024 Apr.
INTRODUCTION: The rapid advancement of artificial intelligence (AI), particularly the development of Large Language Models (LLMs) such as Generative Pretrained Transformers (GPTs), has revolutionized numerous fields. The purpose of this study is to investigate the application of LLMs within the realm of orthopaedic in training examinations. METHODS: Questions from the 2020-2022 Orthopaedic In-Service Training Exams (OITEs) were given to OpenAI's GPT-3.5 Turbo and GPT-4 LLMs, using a zero-shot inference approach. Each model was given a multiple-choice question, without prior exposure to similar queries, and their generated responses were compared to the correct answer within each OITE. The models were evaluated on overall accuracy, performance on questions with and without media, and performance on first- and higher-order questions. RESULTS: The GPT-4 model outperformed the GPT-3.5 Turbo model across all years and question categories (2022: 67.63% vs. 50.24%; 2021: 58.69% vs. 47.42%; 2020: 59.53% vs. 46.51%). Both models showcased better performance with questions devoid of associated media, with GPT-4 attaining accuracies of 68.80%, 65.14%, and 68.22% for 2022, 2021, and 2020, respectively. GPT-4 outscored GPT-3.5 Turbo on first-order questions across all years (2022: 63.83% vs. 38.30%; 2021: 57.45% vs. 50.00%; 2020: 65.74% vs. 53.70%). GPT-4 also outscored GPT-3.5 Turbo on higher-order questions across all years (2022: 68.75% vs. 53.75%; 2021: 59.66% vs. 45.38%; 2020: 53.27% vs. 39.25%). DISCUSSION: GPT-4 showed improved performance compared to GPT-3.5 Turbo in all tested categories. The results reflect the potential and limitations of AI in orthopaedics. GPT-4's performance is comparable to a second-to-third-year resident and GPT-3.5 Turbo's performance is comparable to a first-year resident, suggesting the application of current LLMs can neither pass the OITE nor substitute orthopaedic training. This study sets a precedent for future endeavors integrating GPT models into orthopaedic education and underlines the necessity for specialized training of these models for specific medical domains.
引言:人工智能(AI)的迅速发展,尤其是生成式预训练变换器(GPT)等大语言模型(LLM)的发展,已经彻底改变了众多领域。本研究的目的是调查大语言模型在骨科培训考试领域的应用。 方法:采用零样本推理方法,将2020 - 2022年骨科在职培训考试(OITE)的问题提供给OpenAI的GPT - 3.5 Turbo和GPT - 4大语言模型。每个模型被给予一道多项选择题,事先未接触过类似问题,然后将它们生成的答案与每个OITE中的正确答案进行比较。对模型在总体准确性、有无媒体问题的表现以及一阶和高阶问题的表现进行评估。 结果:在所有年份和问题类别中,GPT - 4模型的表现均优于GPT - 3.5 Turbo模型(2022年:67.63%对50.24%;2021年:58.69%对47.42%;2020年:59.53%对46.51%)。两个模型在没有相关媒体的问题上表现更好,2022年、2021年和2020年GPT - 4的准确率分别为68.80%、65.14%和68.22%。在所有年份的一阶问题上,GPT - 4的得分均高于GPT - 3.5 Turbo(2022年:63.83%对38.30%;2021年:57.45%对50.00%;2020年:65.74%对53.70%)。在所有年份的高阶问题上,GPT - 4的得分也高于GPT - 3.5 Turbo(2022年:68.75%对53.75%;2021年:59.66%对45.38%;2020年:53.27%对39.25%)。 讨论:与GPT - 3.5 Turbo相比,GPT - 4在所有测试类别中表现出更好的性能。结果反映了人工智能在骨科领域的潜力和局限性。GPT - 4的表现与二至三年级住院医师相当,GPT - 3.5 Turbo的表现与一年级住院医师相当,这表明当前的大语言模型既不能通过OITE考试,也不能替代骨科培训。本研究为未来将GPT模型整合到骨科教育中的努力树立了先例,并强调了针对特定医学领域对这些模型进行专门训练的必要性。
J Orthop Surg (Hong Kong). 2025
J Am Acad Orthop Surg. 2023-12-1
J Clin Med. 2025-8-20
Med Sci Educ. 2024-11-26
Acta Orthop Traumatol Turc. 2025-3-17
J Hand Microsurg. 2025-1-23
Philos Trans A Math Phys Eng Sci. 2024-4-15
Neurosurgery. 2023-12-1
Eur Heart J Digit Health. 2023-4-24
Anat Sci Educ. 2024
PLOS Digit Health. 2023-2-9
Anat Sci Educ. 2024-3
Clin Orthop Relat Res. 1971