Katebzadeh Shahbaz, Pickett-Nairne Kaci, Nguyen Paloma Reyes, Puranik Chaitanya Prakash
Assistant Professor, Department of Pediatric Dentistry, School of Dental Medicine, University of Colorado Anschutz Medical Campus, Aurora, Colo., USA and Assistant Professor, Children's Hospital Colorado, Aurora, Colo., USA.
Biostatistician and a Research Instructor, Center for Research Outcomes in Children's Surgery (ROCS), Center for Children's Surgery, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, Colo., USA.
Pediatr Dent. 2025 Mar 15;47(2):79-84.
To determine the comparative accuracy of seven generative artificial intelligence (GenAI) platforms in answering multiple-choice questions on a predoctoral pediatric dentistry examination. This study evaluated the impact of question type and GenAI training on accuracy. In this study, 100 multiple-choice questions were answered by seven GenAIs using a standard prompt. The study included five untrained GenAIs (Llama, Gemini, Copilot, ChatGPT3.5, and ChatGPT4) and two trained GenAIs (ChatGPT3.5 and ChatGPT4). The training of GenAIs was performed using evidence-based data. The questions were categorized as knowledge-based versus critical thinking on 10 subspecialty domains. The GenAIs were asked to select one correct answer from four choices, and only the first generated response was recorded. Data were subjected to statistical analysis (alpha equals 0.05), with a passing score of 75 percent. Trained ChatGPT4 had the highest accuracy score (90 percent), while untrained Copilot had the lowest accuracy score (57 percent). Only three GenAIs received a passing score (trained ChatGPT3.5, untrained and trained ChatGPT4). The average performance of these three GenAIs (87 percent) was comparable to that of dental students (89 percent). There was no difference in the accuracy of GenAI in answering knowledge-based or critical-thinking questions. Similarly, sub-specialty domain types did not impact the accuracy of GenAI. Newer or trained models of generative artificial intelligence have higher accuracy compared to older or untrained models of GenAI. In the future, due to high accuracy, newer or trained models of GenAI can be used as adjuncts in predoctoral pediatric dental education.
为确定七个生成式人工智能(GenAI)平台在回答博士前小儿牙科考试多项选择题时的相对准确性。本研究评估了问题类型和GenAI训练对准确性的影响。在本研究中,七个GenAI使用标准提示回答了100道多项选择题。该研究包括五个未训练的GenAI(Llama、Gemini、Copilot、ChatGPT3.5和ChatGPT4)和两个经过训练的GenAI(ChatGPT3.5和ChatGPT4)。GenAI的训练使用基于证据的数据进行。问题根据10个亚专业领域分为基于知识的问题和批判性思维问题。要求GenAI从四个选项中选择一个正确答案,并且只记录首次生成的回答。数据进行了统计分析(α等于0.05),及格分数为75%。经过训练的ChatGPT4准确率最高(90%),而未训练的Copilot准确率最低(57%)。只有三个GenAI获得了及格分数(经过训练的ChatGPT3.5、未训练和经过训练的ChatGPT4)。这三个GenAI的平均表现(87%)与牙科学生的表现(89%)相当。GenAI在回答基于知识的问题或批判性思维问题时的准确性没有差异。同样,亚专业领域类型也没有影响GenAI的准确性。与旧的或未训练的GenAI模型相比,新的或经过训练的生成式人工智能模型具有更高的准确性。未来,由于准确性高,新的或经过训练的GenAI模型可作为博士前小儿牙科教育的辅助工具。