Jaworski Aleksander, Jasiński Dawid, Sławińska Barbara, Błecha Zuzanna, Jaworski Wojciech, Kruplewicz Maja, Jasińska Natalia, Sysło Oliwia, Latkowska Ada, Jung Magdalena
Department of Plastic Surgery, Specialist Medical Center, Polanica-Zdrój, POL.
Department of Medicine, Prof. K. Gibiński University Clinical Center of the Medical University of Silesia in Katowice, Katowice, POL.
Cureus. 2024 Sep 6;16(9):e68813. doi: 10.7759/cureus.68813. eCollection 2024 Sep.
Background This study aims to evaluate the performance of OpenAI's GPT-4o in the Polish Final Dentistry Examination (LDEK) and compare it with human candidates' results. The LDEK is a standardized test essential for dental graduates in Poland to obtain their professional license. With artificial intelligence (AI) becoming increasingly integrated into medical and dental education, it is important to assess AI's capabilities in such high-stakes examinations. Materials and methods The study was conducted from August 1 to August 15, 2024, using the Spring 2023 LDEK exam. The exam comprised 200 multiple-choice questions, each with one correct answer among five options. Questions spanned various dental disciplines, including Conservative Dentistry with Endodontics, Pediatric Dentistry, Dental Surgery, Prosthetic Dentistry, Periodontology, Orthodontics, Emergency Medicine, Bioethics and Medical Law, Medical Certification, and Public Health. The exam organizers withdrew one question. GPT-4o was tested on these questions without access to the publicly available question bank. The AI model's responses were recorded, and each answer's confidence level was assessed. Correct answers were determined based on the official key provided by the Center for Medical Education (CEM) in Łódź, Poland. Statistical analyses, including Pearson's chi-square test and the Mann-Whitney U test, were performed to evaluate the accuracy and confidence of ChatGPT's answers across different dental fields. Results GPT-4o correctly answered 141 out of 199 valid questions (70.85%) and incorrectly answered 58 (29.15%). The AI performed better in fields like Conservative Dentistry with Endodontics (71.74%) and Prosthetic Dentistry (80%) but showed lower accuracy in Pediatric Dentistry (62.07%) and Orthodontics (52.63%). A statistically significant difference was observed between ChatGPT's performance on clinical case-based questions (36.36% accuracy) and other factual questions (72.87% accuracy), with a p-value of 0.025. Confidence levels also varied significantly between correct and incorrect answers, with a p-value of 0.0208. Conclusions GPT-4o's performance in the LDEK suggests it has potential as a supplementary educational tool in dentistry. However, the AI's limited clinical reasoning abilities, especially in complex scenarios, reveal a substantial gap between AI and human expertise. While ChatGPT demonstrates strong performance in factual recall, it cannot yet match the critical thinking and clinical judgment exhibited by human candidates.
背景 本研究旨在评估OpenAI的GPT-4o在波兰牙科资格考试(LDEK)中的表现,并将其与人类考生的成绩进行比较。LDEK是波兰牙科专业毕业生获得职业资格的一项标准化考试。随着人工智能(AI)越来越多地融入医学和牙科教育,评估AI在这种高风险考试中的能力非常重要。
材料与方法 本研究于2024年8月1日至8月15日进行,使用的是2023年春季的LDEK考试。该考试由200道多项选择题组成,每个问题有五个选项,其中只有一个正确答案。问题涵盖了各个牙科领域,包括牙体牙髓病学、儿童牙科学、口腔颌面外科学、口腔修复学、牙周病学、口腔正畸学、急诊医学、生物伦理学与医学法学、医学认证以及公共卫生。考试组织者去掉了一道题。GPT-4o在这些问题上进行了测试,无法访问公开可用的题库。记录了AI模型的回答,并评估了每个答案的置信度。正确答案根据波兰罗兹医学教育中心(CEM)提供的官方答案确定。进行了包括Pearson卡方检验和Mann-Whitney U检验在内的统计分析,以评估ChatGPT在不同牙科领域答案的准确性和置信度。
结果 GPT-4o在199道有效问题中正确回答了141道(70.85%),错误回答了58道(29.15%)。AI在牙体牙髓病学(71.74%)和口腔修复学(80%)等领域表现较好,但在儿童牙科学(62.07%)和口腔正畸学(52.63%)方面准确率较低。观察到ChatGPT在基于临床病例的问题上的表现(准确率36.36%)与其他事实性问题(准确率72.87%)之间存在统计学显著差异,p值为0.025。正确答案和错误答案的置信度水平也有显著差异,p值为0.0208。
结论 GPT-4o在LDEK中的表现表明它有潜力成为牙科领域的一种辅助教育工具。然而,AI有限的临床推理能力,特别是在复杂场景中,揭示了AI与人类专业知识之间存在巨大差距。虽然ChatGPT在事实性记忆方面表现出色,但它还无法与人类考生展现出的批判性思维和临床判断力相匹配。