GPT-4o与人类考生：波兰牙科最终考试中的表现分析

GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination.

作者信息

Jaworski Aleksander, Jasiński Dawid, Sławińska Barbara, Błecha Zuzanna, Jaworski Wojciech, Kruplewicz Maja, Jasińska Natalia, Sysło Oliwia, Latkowska Ada, Jung Magdalena

机构信息

Department of Plastic Surgery, Specialist Medical Center, Polanica-Zdrój, POL.

Department of Medicine, Prof. K. Gibiński University Clinical Center of the Medical University of Silesia in Katowice, Katowice, POL.

出版信息

Cureus. 2024 Sep 6;16(9):e68813. doi: 10.7759/cureus.68813. eCollection 2024 Sep.

DOI:10.7759/cureus.68813

PMID:39371744

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11456324/

Abstract

Background This study aims to evaluate the performance of OpenAI's GPT-4o in the Polish Final Dentistry Examination (LDEK) and compare it with human candidates' results. The LDEK is a standardized test essential for dental graduates in Poland to obtain their professional license. With artificial intelligence (AI) becoming increasingly integrated into medical and dental education, it is important to assess AI's capabilities in such high-stakes examinations. Materials and methods The study was conducted from August 1 to August 15, 2024, using the Spring 2023 LDEK exam. The exam comprised 200 multiple-choice questions, each with one correct answer among five options. Questions spanned various dental disciplines, including Conservative Dentistry with Endodontics, Pediatric Dentistry, Dental Surgery, Prosthetic Dentistry, Periodontology, Orthodontics, Emergency Medicine, Bioethics and Medical Law, Medical Certification, and Public Health. The exam organizers withdrew one question. GPT-4o was tested on these questions without access to the publicly available question bank. The AI model's responses were recorded, and each answer's confidence level was assessed. Correct answers were determined based on the official key provided by the Center for Medical Education (CEM) in Łódź, Poland. Statistical analyses, including Pearson's chi-square test and the Mann-Whitney U test, were performed to evaluate the accuracy and confidence of ChatGPT's answers across different dental fields. Results GPT-4o correctly answered 141 out of 199 valid questions (70.85%) and incorrectly answered 58 (29.15%). The AI performed better in fields like Conservative Dentistry with Endodontics (71.74%) and Prosthetic Dentistry (80%) but showed lower accuracy in Pediatric Dentistry (62.07%) and Orthodontics (52.63%). A statistically significant difference was observed between ChatGPT's performance on clinical case-based questions (36.36% accuracy) and other factual questions (72.87% accuracy), with a p-value of 0.025. Confidence levels also varied significantly between correct and incorrect answers, with a p-value of 0.0208. Conclusions GPT-4o's performance in the LDEK suggests it has potential as a supplementary educational tool in dentistry. However, the AI's limited clinical reasoning abilities, especially in complex scenarios, reveal a substantial gap between AI and human expertise. While ChatGPT demonstrates strong performance in factual recall, it cannot yet match the critical thinking and clinical judgment exhibited by human candidates.

摘要

背景本研究旨在评估OpenAI的GPT-4o在波兰牙科资格考试（LDEK）中的表现，并将其与人类考生的成绩进行比较。LDEK是波兰牙科专业毕业生获得职业资格的一项标准化考试。随着人工智能（AI）越来越多地融入医学和牙科教育，评估AI在这种高风险考试中的能力非常重要。

材料与方法本研究于2024年8月1日至8月15日进行，使用的是2023年春季的LDEK考试。该考试由200道多项选择题组成，每个问题有五个选项，其中只有一个正确答案。问题涵盖了各个牙科领域，包括牙体牙髓病学、儿童牙科学、口腔颌面外科学、口腔修复学、牙周病学、口腔正畸学、急诊医学、生物伦理学与医学法学、医学认证以及公共卫生。考试组织者去掉了一道题。GPT-4o在这些问题上进行了测试，无法访问公开可用的题库。记录了AI模型的回答，并评估了每个答案的置信度。正确答案根据波兰罗兹医学教育中心（CEM）提供的官方答案确定。进行了包括Pearson卡方检验和Mann-Whitney U检验在内的统计分析，以评估ChatGPT在不同牙科领域答案的准确性和置信度。

结果 GPT-4o在199道有效问题中正确回答了141道（70.85%），错误回答了58道（29.15%）。AI在牙体牙髓病学（71.74%）和口腔修复学（80%）等领域表现较好，但在儿童牙科学（62.07%）和口腔正畸学（52.63%）方面准确率较低。观察到ChatGPT在基于临床病例的问题上的表现（准确率36.36%）与其他事实性问题（准确率72.87%）之间存在统计学显著差异，p值为0.025。正确答案和错误答案的置信度水平也有显著差异，p值为0.0208。

结论 GPT-4o在LDEK中的表现表明它有潜力成为牙科领域的一种辅助教育工具。然而，AI有限的临床推理能力，特别是在复杂场景中，揭示了AI与人类专业知识之间存在巨大差距。虽然ChatGPT在事实性记忆方面表现出色，但它还无法与人类考生展现出的批判性思维和临床判断力相匹配。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/41c1/11456324/633968a6f15d/cureus-0016-00000068813-i01.jpg

相似文献

GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination.GPT-4o与人类考生：波兰牙科最终考试中的表现分析

Cureus. 2024 Sep 6;16(9):e68813. doi: 10.7759/cureus.68813. eCollection 2024 Sep.

Comparison of the Performance of Artificial Intelligence Versus Medical Professionals in the Polish Final Medical Examination.人工智能与医学专业人员在波兰医学期末考试中的表现比较

Cureus. 2024 Aug 2;16(8):e66011. doi: 10.7759/cureus.66011. eCollection 2024 Aug.

Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.评估问题特征对 ChatGPT 表现和回应解释一致性的影响：来自台湾护理执照考试的见解。

Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8.

Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis.ChatGPT-3.5 和 GPT-4 在医学、药学、牙科和护理国家执照考试中的表现：系统评价和荟萃分析。

BMC Med Educ. 2024 Sep 16;24(1):1013. doi: 10.1186/s12909-024-05944-8.

GPT-4o’s competency in answering the simulated written European Board of Interventional Radiology exam compared to a medical student and experts in Germany and its ability to generate exam items on interventional radiology: a descriptive study.GPT-4o 在回答模拟的欧洲介入放射学委员会考试方面的能力与德国医学生和专家相比，以及其在介入放射学方面生成考试项目的能力：一项描述性研究。

J Educ Eval Health Prof. 2024;21:21. doi: 10.3352/jeehp.2024.21.21. Epub 2024 Aug 20.

Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现：观察性研究。

JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.

Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination.与GPT-3.5、GPT-4和GPT-4o相比，定制生成式预训练变换器（Custom GPTs）在提升性能和证据方面如何？一项关于急诊医学专科考试的研究。

Healthcare (Basel). 2024 Aug 30;12(17):1726. doi: 10.3390/healthcare12171726.

Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study.使用心身医学考试问题评估 ChatGPT 对布鲁姆教育目标分类法的掌握程度：混合方法研究。

J Med Internet Res. 2024 Jan 23;26:e52113. doi: 10.2196/52113.

Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.ChatGPT 在临床医学研究生入学考试中的表现：调查研究。

JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.

Evaluating ChatGPT's effectiveness and tendencies in Japanese internal medicine.评估 ChatGPT 在日本内科学中的有效性和倾向。

J Eval Clin Pract. 2024 Sep;30(6):1017-1023. doi: 10.1111/jep.14011. Epub 2024 May 19.

引用本文的文献

Performance of the ChatGPT-4o Language Model in Solving the Ophthalmology Specialization Exam.ChatGPT-4o语言模型在解决眼科专业考试中的表现。

Cureus. 2025 Jul 28;17(7):e88908. doi: 10.7759/cureus.88908. eCollection 2025 Jul.

Performance of GPT-4o and DeepSeek-R1 in the Polish Infectious Diseases Specialty Exam.GPT-4o和DeepSeek-R1在波兰传染病专业考试中的表现。

Cureus. 2025 Apr 23;17(4):e82870. doi: 10.7759/cureus.82870. eCollection 2025 Apr.

Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination.评估GPT-3.5、GPT-4和GPT-4o在中国国家医师资格考试中的表现。

Sci Rep. 2025 Apr 23;15(1):14119. doi: 10.1038/s41598-025-98949-2.

Artificial intelligence-generated responses to frequently asked questions on coccydynia: Evaluating the accuracy and consistency of GPT-4o's performance.人工智能对尾骨痛常见问题的回答：评估GPT-4o表现的准确性和一致性。

Arch Rheumatol. 2025 Mar 17;40(1):63-71. doi: 10.46497/ArchRheumatol.2025.10966. eCollection 2025 Mar.

本文引用的文献

Cureus. 2024 Aug 2;16(8):e66011. doi: 10.7759/cureus.66011. eCollection 2024 Aug.

Can artificial intelligence predict COVID-19 mortality?人工智能能否预测 COVID-19 死亡率？

Eur Rev Med Pharmacol Sci. 2023 Oct;27(20):9866-9871. doi: 10.26355/eurrev_202310_34163.

Reshaping medical education: Performance of ChatGPT on a PES medical examination.重塑医学教育：ChatGPT 在 PES 医学考试中的表现。

Cardiol J. 2024;31(3):442-450. doi: 10.5603/cj.97517. Epub 2023 Oct 13.

Will ChatGPT pass the Polish specialty exam in radiology and diagnostic imaging? Insights into strengths and limitations.ChatGPT能通过波兰放射学与诊断成像专业考试吗？对其优势与局限的洞察。

Pol J Radiol. 2023 Sep 18;88:e430-e434. doi: 10.5114/pjr.2023.131215. eCollection 2023.

The rise of ChatGPT: Exploring its potential in medical education.ChatGPT 的兴起：探索其在医学教育中的潜力。

Anat Sci Educ. 2024 Jul-Aug;17(5):926-931. doi: 10.1002/ase.2270. Epub 2023 Mar 28.

A Chat(GPT) about the future of scientific publishing.关于科学出版未来的一场Chat（GPT）讨论。

Brain Behav Immun. 2023 May;110:152-154. doi: 10.1016/j.bbi.2023.02.022. Epub 2023 Mar 1.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现：使用大语言模型进行人工智能辅助医学教育的潜力。

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

Artificial intelligence in medicine.医学中的人工智能。

Metabolism. 2017 Apr;69S:S36-S40. doi: 10.1016/j.metabol.2017.01.011. Epub 2017 Jan 11.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

GPT-4o与人类考生：波兰牙科最终考试中的表现分析

GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献