航海七海：ChatGPT 在医学执照考试中的表现的跨国比较。

Sailing the Seven Seas: A Multinational Comparison of ChatGPT's Performance on Medical Licensing Examinations.

机构信息

Division of Hand, Plastic and Aesthetic Surgery, Ludwig-Maximilians University Munich, Ziemssenstrasse 5, 80336, Munich, Germany.

Department of Otolaryngology, Head and Neck Surgery, School of Medicine, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany.

出版信息

Ann Biomed Eng. 2024 Jun;52(6):1542-1545. doi: 10.1007/s10439-023-03338-3. Epub 2023 Aug 8.

DOI:10.1007/s10439-023-03338-3

PMID:37553555

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11082010/

Abstract

PURPOSE

The use of AI-powered technology, particularly OpenAI's ChatGPT, holds significant potential to reshape healthcare and medical education. Despite existing studies on the performance of ChatGPT in medical licensing examinations across different nations, a comprehensive, multinational analysis using rigorous methodology is currently lacking. Our study sought to address this gap by evaluating the performance of ChatGPT on six different national medical licensing exams and investigating the relationship between test question length and ChatGPT's accuracy.

METHODS

We manually inputted a total of 1,800 test questions (300 each from US, Italian, French, Spanish, UK, and Indian medical licensing examination) into ChatGPT, and recorded the accuracy of its responses.

RESULTS

We found significant variance in ChatGPT's test accuracy across different countries, with the highest accuracy seen in the Italian examination (73% correct answers) and the lowest in the French examination (22% correct answers). Interestingly, question length correlated with ChatGPT's performance in the Italian and French state examinations only. In addition, the study revealed that questions requiring multiple correct answers, as seen in the French examination, posed a greater challenge to ChatGPT.

CONCLUSION

Our findings underscore the need for future research to further delineate ChatGPT's strengths and limitations in medical test-taking across additional countries and to develop guidelines to prevent AI-assisted cheating in medical examinations.

摘要

目的

人工智能技术的应用，特别是 OpenAI 的 ChatGPT，具有重塑医疗保健和医学教育的巨大潜力。尽管已经有研究探讨了 ChatGPT 在不同国家的医学执照考试中的表现，但目前缺乏全面、多国家的分析，且使用严格的方法。我们的研究旨在通过评估 ChatGPT 在六项不同国家的医学执照考试中的表现，并调查测试问题长度与 ChatGPT 的准确性之间的关系，来填补这一空白。

方法

我们手动将总共 1800 个测试问题（每个国家 300 个，包括美国、意大利、法国、西班牙、英国和印度的医学执照考试）输入 ChatGPT，并记录其回答的准确性。

结果

我们发现 ChatGPT 在不同国家的考试准确性存在显著差异，意大利考试的准确性最高（73%的正确答案），法国考试的准确性最低（22%的正确答案）。有趣的是，问题长度仅与意大利和法国国家考试中的 ChatGPT 表现相关。此外，该研究表明，需要多个正确答案的问题，如法国考试中的问题，对 ChatGPT 构成了更大的挑战。

结论

我们的发现强调了未来研究的必要性，以进一步阐明 ChatGPT 在其他国家的医学考试中的优势和局限性，并制定防止医学考试中人工智能辅助作弊的指南。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0132/11082010/413139d2c0b6/10439_2023_3338_Fig1_HTML.jpg

相似文献

Sailing the Seven Seas: A Multinational Comparison of ChatGPT's Performance on Medical Licensing Examinations.航海七海：ChatGPT 在医学执照考试中的表现的跨国比较。

Ann Biomed Eng. 2024 Jun;52(6):1542-1545. doi: 10.1007/s10439-023-03338-3. Epub 2023 Aug 8.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现：系统评价和荟萃分析。

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现：观察性研究。

JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.

Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.评估问题特征对 ChatGPT 表现和回应解释一致性的影响：来自台湾护理执照考试的见解。

Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8.

ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice.ChatGPT在德国妇产科考试中的表现——为人工智能强化医学教育和临床实践铺平道路。

Front Med (Lausanne). 2023 Dec 13;10:1296615. doi: 10.3389/fmed.2023.1296615. eCollection 2023.

Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis.ChatGPT-3.5 和 GPT-4 在医学、药学、牙科和护理国家执照考试中的表现：系统评价和荟萃分析。

BMC Med Educ. 2024 Sep 16;24(1):1013. doi: 10.1186/s12909-024-05944-8.

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.纯粹的智慧还是虚假的村庄？对 USMLE Step 3 题型的 ChatGPT 3.5 和 ChatGPT 4 的比较：定量分析。

JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.

In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions.基于 2377 个美国医师执照考试（USMLE）第 1 步风格问题题干中的特定信号词和短语，深入分析 ChatGPT 的表现。

Sci Rep. 2024 Jun 12;14(1):13553. doi: 10.1038/s41598-024-63997-7.

Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.ChatGPT 在临床医学研究生入学考试中的表现：调查研究。

JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.

Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.模型演进和系统角色对 ChatGPT 在中文医师资格考试中表现的影响：对比研究。

JMIR Med Educ. 2024 Aug 13;10:e52784. doi: 10.2196/52784.

引用本文的文献

ChatGPT versus DeepSeek in head and neck cancer staging and treatment planning: guideline-based study.ChatGPT与DeepSeek在头颈癌分期及治疗规划中的比较：基于指南的研究

Eur Arch Otorhinolaryngol. 2025 Jun 17. doi: 10.1007/s00405-025-09524-4.

Assessing ChatGPT responses to patient questions on epidural steroid injections: A comparative study of general vs specific queries.评估ChatGPT对患者关于硬膜外类固醇注射问题的回答：一般问题与特定问题的比较研究。

Interv Pain Med. 2025 May 26;4(2):100592. doi: 10.1016/j.inpm.2025.100592. eCollection 2025 Jun.

Situating governance and regulatory concerns for generative artificial intelligence and large language models in medical education.将生成式人工智能和大语言模型在医学教育中的治理与监管问题置于适当位置。

NPJ Digit Med. 2025 May 27;8(1):315. doi: 10.1038/s41746-025-01721-z.

ChatGPT and Other Large Language Models in Medical Education - Scoping Literature Review.医学教育中的ChatGPT及其他大语言模型——文献综述

Med Sci Educ. 2024 Nov 13;35(1):555-567. doi: 10.1007/s40670-024-02206-6. eCollection 2025 Feb.

Performance of ChatGPT-4 on Taiwanese Traditional Chinese Medicine Licensing Examinations: Cross-Sectional Study.ChatGPT-4在台湾中医师执照考试中的表现：横断面研究。

JMIR Med Educ. 2025 Mar 19;11:e58897. doi: 10.2196/58897.

ChatGPT's Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini.ChatGPT在葡萄牙语医学考试问题上的表现：ChatGPT-3.5 Turbo与ChatGPT-4o Mini的比较分析。

JMIR Med Educ. 2025 Mar 5;11:e65108. doi: 10.2196/65108.

Embracing Large Language Models for Adult Life Support Learning.拥抱用于成人生命支持学习的大语言模型。

Cureus. 2024 Dec 18;16(12):e75961. doi: 10.7759/cureus.75961. eCollection 2024 Dec.

Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study.GPT-3.5和GPT-4在韩国药剂师执照考试中的表现：比较研究。

JMIR Med Educ. 2024 Dec 4;10:e57451. doi: 10.2196/57451.

Artificial intelligence generates proficient Spanish obstetrics and gynecology counseling templates.人工智能生成了专业的西班牙语妇产科咨询模板。

AJOG Glob Rep. 2024 Sep 19;4(4):100400. doi: 10.1016/j.xagr.2024.100400. eCollection 2024 Nov.

Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis.生物医学与健康信息学中的大语言模型：文献计量分析综述

J Healthc Inform Res. 2024 Sep 14;8(4):658-711. doi: 10.1007/s41666-024-00171-8. eCollection 2024 Dec.

本文引用的文献

ChatGPT's quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions.ChatGPT 在不同耳鼻喉科亚专业中的测验技能：对 2576 道选择题和多选题进行 board certification 准备的分析。

Eur Arch Otorhinolaryngol. 2023 Sep;280(9):4271-4278. doi: 10.1007/s00405-023-08051-4. Epub 2023 Jun 7.

Artificial intelligence-enabled simulation of gluteal augmentation: A helpful tool in preoperative outcome simulation?人工智能辅助的臀隆模拟：术前结果模拟的有益工具？

J Plast Reconstr Aesthet Surg. 2023 May;80:94-101. doi: 10.1016/j.bjps.2023.01.039. Epub 2023 Feb 9.

Artificial Intelligence-Enabled Evaluation of Pain Sketches to Predict Outcomes in Headache Surgery.人工智能辅助的疼痛草图评估，预测头痛手术结局。

Plast Reconstr Surg. 2023 Feb 1;151(2):405-411. doi: 10.1097/PRS.0000000000009855. Epub 2022 Nov 15.

A Ready-to-Use Grading Tool for Facial Palsy Examiners-Automated Grading System in Facial Palsy Patients Made Easy.面向面瘫检查者的即用型分级工具——让面瘫患者的自动分级系统变得轻松。

J Pers Med. 2022 Oct 19;12(10):1739. doi: 10.3390/jpm12101739.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

航海七海：ChatGPT 在医学执照考试中的表现的跨国比较。

Sailing the Seven Seas: A Multinational Comparison of ChatGPT's Performance on Medical Licensing Examinations.

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献