ChatGPT 在解决骨科 Board 风格问题方面的表现：ChatGPT 3.5 和 ChatGPT 4 的对比分析

Performance of ChatGPT on Solving Orthopedic Board-Style Questions: A Comparative Analysis of ChatGPT 3.5 and ChatGPT 4.

机构信息

Department of Orthopedic Surgery, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea.

出版信息

Clin Orthop Surg. 2024 Aug;16(4):669-673. doi: 10.4055/cios23179. Epub 2024 Mar 7.

DOI:10.4055/cios23179

PMID:39092297

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11262944/

Abstract

BACKGROUND

The application of artificial intelligence and large language models in the medical field requires an evaluation of their accuracy in providing medical information. This study aimed to assess the performance of Chat Generative Pre-trained Transformer (ChatGPT) models 3.5 and 4 in solving orthopedic board-style questions.

METHODS

A total of 160 text-only questions from the Orthopedic Surgery Department at Seoul National University Hospital, conforming to the format of the Korean Orthopedic Association board certification examinations, were input into the ChatGPT 3.5 and ChatGPT 4 programs. The questions were divided into 11 subcategories. The accuracy rates of the initial answers provided by Chat GPT 3.5 and ChatGPT 4 were analyzed. In addition, inconsistency rates of answers were evaluated by regenerating the responses.

RESULTS

ChatGPT 3.5 answered 37.5% of the questions correctly, while ChatGPT 4 showed an accuracy rate of 60.0% ( < 0.001). ChatGPT 4 demonstrated superior performance across most subcategories, except for the tumor-related questions. The rates of inconsistency in answers were 47.5% for ChatGPT 3.5 and 9.4% for ChatGPT 4.

CONCLUSIONS

ChatGPT 4 showed the ability to pass orthopedic board-style examinations, outperforming ChatGPT 3.5 in accuracy rate. However, inconsistencies in response generation and instances of incorrect answers with misleading explanations require caution when applying ChatGPT in clinical settings or for educational purposes.

摘要

背景

人工智能和大型语言模型在医学领域的应用需要评估其提供医学信息的准确性。本研究旨在评估 Chat Generative Pre-trained Transformer（ChatGPT）模型 3.5 和 4 在解决骨科板样式问题方面的性能。

方法

将首尔国立大学医院骨科的 160 个仅文本问题输入到 ChatGPT 3.5 和 ChatGPT 4 程序中，这些问题符合韩国骨科协会委员会认证考试的格式。问题分为 11 个子类别。分析 ChatGPT 3.5 和 ChatGPT 4 最初提供的答案的准确率。此外，通过重新生成响应来评估答案不一致率。

结果

ChatGPT 3.5 正确回答了 37.5%的问题，而 ChatGPT 4 的准确率为 60.0%（<0.001）。ChatGPT 4 在大多数子类别中表现出色，除了与肿瘤相关的问题。ChatGPT 3.5 的答案不一致率为 47.5%，ChatGPT 4 的答案不一致率为 9.4%。

结论

ChatGPT 4 能够通过骨科板样式考试，在准确率方面优于 ChatGPT 3.5。然而，在生成响应时的不一致和给出错误答案并带有误导性解释的情况，在将 ChatGPT 应用于临床环境或教育目的时需要谨慎。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e8d/11262944/b15ae1c0e717/cios-16-669-g001.jpg

相似文献

Performance of ChatGPT on Solving Orthopedic Board-Style Questions: A Comparative Analysis of ChatGPT 3.5 and ChatGPT 4.ChatGPT 在解决骨科 Board 风格问题方面的表现：ChatGPT 3.5 和 ChatGPT 4 的对比分析

Clin Orthop Surg. 2024 Aug;16(4):669-673. doi: 10.4055/cios23179. Epub 2024 Mar 7.

ChatGPT performance on the American Shoulder and Elbow Surgeons maintenance of certification exam.ChatGPT 在美肩肘外科医生认证考试维护部分的表现。

J Shoulder Elbow Surg. 2024 Sep;33(9):1888-1893. doi: 10.1016/j.jse.2024.02.029. Epub 2024 Apr 4.

The Rapid Development of Artificial Intelligence: GPT-4's Performance on Orthopedic Surgery Board Questions.人工智能的快速发展：GPT-4 在骨科手术委员会问题上的表现。

Orthopedics. 2024 Mar-Apr;47(2):e85-e89. doi: 10.3928/01477447-20230922-05. Epub 2023 Sep 27.

Performance of ChatGPT on American Board of Surgery In-Training Examination Preparation Questions.ChatGPT 在美外科学院住院医师考试备考问题上的表现。

J Surg Res. 2024 Jul;299:329-335. doi: 10.1016/j.jss.2024.04.060. Epub 2024 May 23.

Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank.将 ChatGPT-3.5、ChatGPT-4、Bing Chat 和 Bard 用于韩国急诊医学 board 考试题库的问题解决性能比较。

Medicine (Baltimore). 2024 Mar 1;103(9):e37325. doi: 10.1097/MD.0000000000037325.

Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.人工智能能通过美国骨科医师学会考试吗？骨科住院医师与ChatGPT的对比。

Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.

Inadequate Performance of ChatGPT on Orthopedic Board-Style Written Exams.ChatGPT在骨科委员会风格笔试中的表现不佳。

Cureus. 2024 Jun 18;16(6):e62643. doi: 10.7759/cureus.62643. eCollection 2024 Jun.

Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations.ChatGPT-3.5、ChatGPT-4 和骨科住院医师在骨科评估考试中的表现比较。

J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4.

Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study.使用高级 AI 学习和分析方法评估 ChatGPT-4 在家庭医学委员会考试中的表现：观察性研究。

JMIR Med Educ. 2024 Oct 8;10:e56128. doi: 10.2196/56128.

Could ChatGPT Pass the UK Radiology Fellowship Examinations?ChatGPT 能通过英国放射科医师研究员考试吗？

Acad Radiol. 2024 May;31(5):2178-2182. doi: 10.1016/j.acra.2023.11.026. Epub 2023 Dec 29.

引用本文的文献

Exploring ChatGPT's Efficacy in Orthopaedic Arthroplasty Questions Compared to Adult Reconstruction Surgeons.与成人重建外科医生相比，探究ChatGPT在骨科关节置换问题方面的效能。

Arthroplast Today. 2025 Jul 14;34:101772. doi: 10.1016/j.artd.2025.101772. eCollection 2025 Aug.

Evaluating the Accuracy and Performance of ChatGPT-4o in Solving Japanese National Dental Technician Examination.评估ChatGPT-4o在解决日本国家牙科技师考试问题中的准确性和性能。

Int Dent J. 2025 Jun 9;75(4):100847. doi: 10.1016/j.identj.2025.100847.

Is it a pediatric orthopaedic urgency or not? Can ChatGPT answer this question?这是否属于小儿骨科急症？ChatGPT能回答这个问题吗？

J Orthop Surg Res. 2025 Jun 4;20(1):567. doi: 10.1186/s13018-025-05981-z.

Exploring the Current Applications of Artificial Intelligence in Orthopaedic Surgical Training: A Systematic Scoping Review.探索人工智能在骨科手术培训中的当前应用：一项系统的范围综述。

Cureus. 2025 Apr 3;17(4):e81671. doi: 10.7759/cureus.81671. eCollection 2025 Apr.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性：系统评价与网络荟萃分析

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study.评估大语言模型为中国重症肌无力性眼病患者提供患者教育的有效性：混合方法研究

J Med Internet Res. 2025 Apr 10;27:e67883. doi: 10.2196/67883.

Opportunities and Challenges of Chatbots in Ophthalmology: A Narrative Review.眼科领域中聊天机器人的机遇与挑战：一篇叙述性综述

J Pers Med. 2024 Dec 21;14(12):1165. doi: 10.3390/jpm14121165.

本文引用的文献

Performance of ChatGPT in Board Examinations for Specialists in the Japanese Ophthalmology Society.ChatGPT在日本眼科学会专科医生资格考试中的表现。

Cureus. 2023 Dec 4;15(12):e49903. doi: 10.7759/cureus.49903. eCollection 2023 Dec.

Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard.人工智能聊天机器人在睡眠医学认证委员会考试中的表现：ChatGPT 与 Google Bard 对比。

Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2137-2143. doi: 10.1007/s00405-023-08381-3. Epub 2023 Dec 20.

A Radiation Oncology Board Exam of ChatGPT.ChatGPT的放射肿瘤学委员会考试。

Cureus. 2023 Sep 1;15(9):e44541. doi: 10.7759/cureus.44541. eCollection 2023 Sep.

Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4.大型语言模型在风湿病学中的诊断准确性：医生和 ChatGPT-4 的比较。

Rheumatol Int. 2024 Feb;44(2):303-306. doi: 10.1007/s00296-023-05464-6. Epub 2023 Sep 24.

Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination.评估ChatGPT在骨科住院医师培训考试中的表现。

JB JS Open Access. 2023 Sep 8;8(3). doi: 10.2106/JBJS.OA.23.00056. eCollection 2023 Jul-Sep.

J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4.

Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.

GPT-4 in Radiology: Improvements in Advanced Reasoning.GPT-4 在放射学中的应用：高级推理能力的提升。

Radiology. 2023 Jun;307(5):e230987. doi: 10.1148/radiol.230987. Epub 2023 May 16.

Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations.ChatGPT 在放射科 Board 考试中的表现：当前优势和局限性的深入了解。

Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. Epub 2023 May 16.

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。

JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

ChatGPT 在解决骨科 Board 风格问题方面的表现：ChatGPT 3.5 和 ChatGPT 4 的对比分析

Performance of ChatGPT on Solving Orthopedic Board-Style Questions: A Comparative Analysis of ChatGPT 3.5 and ChatGPT 4.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献