GPT-3.5、GPT-4和GPT-4V之间的比较：大型语言模型（ChatGPT）能通过日本骨科手术委员会考试吗？

A Comparison Between GPT-3.5, GPT-4, and GPT-4V: Can the Large Language Model (ChatGPT) Pass the Japanese Board of Orthopaedic Surgery Examination?

作者信息

Nakajima Nozomu, Fujimori Takahito, Furuya Masayuki, Kanie Yuya, Imai Hirotatsu, Kita Kosuke, Uemura Keisuke, Okada Seiji

机构信息

Orthopaedics, Sakai City Medical Center, Sakai, JPN.

Orthopaedic Surgery, Osaka University, Graduate School of Medicine, Suita, JPN.

出版信息

Cureus. 2024 Mar 18;16(3):e56402. doi: 10.7759/cureus.56402. eCollection 2024 Mar.

DOI:10.7759/cureus.56402

PMID:38633935

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11023708/

Abstract

Introduction Recently, large-scale language models, such as ChatGPT (OpenAI, San Francisco, CA), have evolved. These models are designed to think and act like humans and possess a broad range of specialized knowledge. GPT-3.5 was reported to be at a level of passing the United States Medical Licensing Examination. Its capabilities continue to evolve, and in October 2023, GPT-4V became available as a model capable of image recognition. Therefore, it is important to know the current performance of these models because they will be soon incorporated into medical practice. We aimed to evaluate the performance of ChatGPT in the field of orthopedic surgery. Methods We used three years' worth of Japanese Board of Orthopaedic Surgery Examinations (JBOSE) conducted in 2021, 2022, and 2023. Questions and their multiple-choice answers were used in their original Japanese form, as was the official examination rubric. We inputted these questions into three versions of ChatGPT: GPT-3.5, GPT-4, and GPT-4V. For image-based questions, we inputted only textual statements for GPT-3.5 and GPT-4, and both image and textual statements for GPT-4V. As the minimum scoring rate acquired to pass is not officially disclosed, it was calculated using publicly available data. Results The estimated minimum scoring rate acquired to pass was calculated as 50.1% (43.7-53.8%). For GPT-4, even when answering all questions, including the image-based ones, the percentage of correct answers was 59% (55-61%) and GPT-4 was able to achieve the passing line. When excluding image-based questions, the score reached 67% (63-73%). For GPT-3.5, the percentage was limited to 30% (28-32%), and this version could not pass the examination. There was a significant difference in the performance between GPT-4 and GPT-3.5 (p < 0.001). For image-based questions, the percentage of correct answers was 25% in GPT-3.5, 38% in GPT-4, and 38% in GPT-4V. There was no significant difference in the performance for image-based questions between GPT-4 and GPT-4V. Conclusions ChatGPT had enough performance to pass the orthopedic specialist examination. After adding further training data such as images, ChatGPT is expected to be applied to the orthopedics field.

摘要

引言最近，诸如ChatGPT（OpenAI，加利福尼亚州旧金山）之类的大规模语言模型得到了发展。这些模型旨在像人类一样思考和行动，并拥有广泛的专业知识。据报道，GPT-3.5达到了通过美国医学执照考试的水平。其能力不断发展，2023年10月，GPT-4V作为一种能够进行图像识别的模型问世。因此，了解这些模型的当前性能很重要，因为它们很快就会被纳入医疗实践。我们旨在评估ChatGPT在骨科手术领域的性能。

方法我们使用了2021年、2022年和2023年进行的为期三年的日本骨科手术委员会考试（JBOSE）。问题及其多项选择题答案以原始日语形式使用，官方考试评分标准也是如此。我们将这些问题输入到ChatGPT的三个版本中：GPT-3.5、GPT-4和GPT-4V。对于基于图像的问题，我们仅为GPT-3.5和GPT-4输入文本陈述，为GPT-4V输入图像和文本陈述。由于官方未公布通过考试所需的最低得分率，因此使用公开数据进行计算。

结果通过考试所需的估计最低得分率计算为50.1%（43.7-53.8%）。对于GPT-4，即使回答所有问题，包括基于图像的问题，正确答案的百分比为59%（55-61%），GPT-4能够达到及格线。排除基于图像的问题后，得分达到67%（63-73%）。对于GPT-3.5，该百分比仅限于30%（28-32%），此版本未能通过考试。GPT-4和GPT-3.5之间的性能存在显著差异（p<0.001）。对于基于图像的问题，GPT-3.5的正确答案百分比为25%，GPT-4为38%，GPT-4V为38%。GPT-4和GPT-4V在基于图像的问题上的性能没有显著差异。

结论 ChatGPT具备通过骨科专家考试的足够性能。在添加诸如图像等更多训练数据后，预计ChatGPT将应用于骨科领域。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bf65/11023708/9c505ebccfe1/cureus-0016-00000056402-i01.jpg

相似文献

A Comparison Between GPT-3.5, GPT-4, and GPT-4V: Can the Large Language Model (ChatGPT) Pass the Japanese Board of Orthopaedic Surgery Examination?GPT-3.5、GPT-4和GPT-4V之间的比较：大型语言模型（ChatGPT）能通过日本骨科手术委员会考试吗？

Cureus. 2024 Mar 18;16(3):e56402. doi: 10.7759/cureus.56402. eCollection 2024 Mar.

Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.GPT-4V 在回答日本耳鼻喉科学委员会认证考试问题方面的表现：评估研究。

JMIR Med Educ. 2024 Mar 28;10:e57054. doi: 10.2196/57054.

Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations.ChatGPT-3.5、ChatGPT-4 和骨科住院医师在骨科评估考试中的表现比较。

J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4.

Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study.ChatGPT在日本国家医师资格考试医学问题上的准确性：评估研究

JMIR Form Res. 2023 Oct 13;7:e48023. doi: 10.2196/48023.

Performance of the Large Language Model ChatGPT on the National Nurse Examinations in Japan: Evaluation Study.大型语言模型ChatGPT在日本国家护士考试中的表现：评估研究

JMIR Nurs. 2023 Jun 27;6:e47305. doi: 10.2196/47305.

Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现：观察性研究。

JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.

Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.GPT-4V（视觉）在日本国家医师资格考试中的能力：评估研究。

JMIR Med Educ. 2024 Mar 12;10:e54393. doi: 10.2196/54393.

Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.人工智能能通过美国骨科医师学会考试吗？骨科住院医师与ChatGPT的对比。

Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现：系统评价和荟萃分析。

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists.GPT 各代产品在专为认证医师为认证临床骨密度技师而设计的考试中的表现。

J Clin Densitom. 2024 Apr-Jun;27(2):101480. doi: 10.1016/j.jocd.2024.101480. Epub 2024 Feb 17.

引用本文的文献

Systematic Review on Large Language Models in Orthopaedic Surgery.骨科手术中大型语言模型的系统评价

J Clin Med. 2025 Aug 20;14(16):5876. doi: 10.3390/jcm14165876.

Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study.通过在日本外科医师资格考试中使用纯文本和图文并茂的问题评估GPT-4和GPT-4o来研究人工智能在外科培训中的作用：性能评估研究

JMIR Med Educ. 2025 Jul 30;11:e69313. doi: 10.2196/69313.

Evaluating retrieval augmented generation and ChatGPT's accuracy on orthopaedic examination assessment questions.评估检索增强生成技术及ChatGPT在骨科检查评估问题上的准确性。

Ann Jt. 2025 Apr 22;10:12. doi: 10.21037/aoj-24-49. eCollection 2025.

Automated extraction of functional biomarkers of verbal and ambulatory ability from multi-institutional clinical notes using large language models.使用大语言模型从多机构临床记录中自动提取言语和行动能力的功能生物标志物。

J Neurodev Disord. 2025 Apr 30;17(1):24. doi: 10.1186/s11689-025-09612-w.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性：系统评价与网络荟萃分析

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Assessing the Current Limitations of Large Language Models in Advancing Health Care Education.评估大语言模型在推进医疗保健教育方面的当前局限性。

JMIR Form Res. 2025 Jan 16;9:e51319. doi: 10.2196/51319.

Performance of Generative Pre-trained Transformer (GPT)-4 and Gemini Advanced on the First-Class Radiation Protection Supervisor Examination in Japan.生成式预训练变换器（GPT）-4和Gemini Advanced在日本一级放射防护主管考试中的表现。

Cureus. 2024 Oct 1;16(10):e70614. doi: 10.7759/cureus.70614. eCollection 2024 Oct.

ChatGPT as an effective tool for quality evaluation of radiomics research.ChatGPT作为一种用于影像组学研究质量评估的有效工具。

Eur Radiol. 2025 Apr;35(4):2030-2042. doi: 10.1007/s00330-024-11122-7. Epub 2024 Oct 15.

本文引用的文献

Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination.评估ChatGPT在骨科住院医师培训考试中的表现。

JB JS Open Access. 2023 Sep 8;8(3). doi: 10.2106/JBJS.OA.23.00056. eCollection 2023 Jul-Sep.

Fabrication and errors in the bibliographic citations generated by ChatGPT.ChatGPT生成的文献引用中的编造与错误。

Sci Rep. 2023 Sep 7;13(1):14045. doi: 10.1038/s41598-023-41032-5.

J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4.

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.ChatGPT和GPT-4在神经外科笔试中的表现。

Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15.

Assessing ChatGPT's ability to pass the FRCS orthopaedic part A exam: A critical analysis.评估 ChatGPT 通过 FRCS 骨科 A 部分考试的能力：批判性分析。

Surgeon. 2023 Oct;21(5):263-266. doi: 10.1016/j.surge.2023.07.001. Epub 2023 Jul 28.

Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study.GPT-3.5和GPT-4在日本医师执照考试中的表现：比较研究。

JMIR Med Educ. 2023 Jun 29;9:e48002. doi: 10.2196/48002.

Hallucinations in ChatGPT: A Cautionary Tale for Biomedical Researchers.ChatGPT中的幻觉：给生物医学研究人员的警示故事。

Am J Med. 2023 Nov;136(11):1059-1060. doi: 10.1016/j.amjmed.2023.06.012. Epub 2023 Jun 25.

Chat Generative Pretrained Transformer Fails the Multiple-Choice American College of Gastroenterology Self-Assessment Test.ChatGPT 答错多项选择题美国胃肠病学院自测题

Am J Gastroenterol. 2023 Dec 1;118(12):2280-2282. doi: 10.14309/ajg.0000000000002320. Epub 2023 May 22.

Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations.ChatGPT 在放射科 Board 考试中的表现：当前优势和局限性的深入了解。

Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. Epub 2023 May 16.

ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns.ChatGPT在医学教育、研究与实践中的应用：对其前景与合理担忧的系统评价

Healthcare (Basel). 2023 Mar 19;11(6):887. doi: 10.3390/healthcare11060887.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

GPT-3.5、GPT-4和GPT-4V之间的比较：大型语言模型（ChatGPT）能通过日本骨科手术委员会考试吗？

A Comparison Between GPT-3.5, GPT-4, and GPT-4V: Can the Large Language Model (ChatGPT) Pass the Japanese Board of Orthopaedic Surgery Examination?

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献