Jalali Poorya, Mohammad-Rahimi Hossein, Wang Feng-Ming, Sohrabniya Fatemeh, AmirHossein Ourang Seyed, Tian Yuke, Martinho Frederico C, Nosrat Ali
Department of Endodontics, Texas A&M College of Dentistry, Dallas, Texas.
Department of Dentistry and Oral Health, Aarhus University, Aarhus, Denmark; Conservative Dentistry and Periodontology, LMU Klinikum, LMU, Munich, Germany.
J Endod. 2025 Jun 26. doi: 10.1016/j.joen.2025.06.014.
The aim of this study was to assess the overall performance of artificial intelligence chatbots in answering board-style endodontic questions.
One hundred multiple choice endodontic questions, following the style of American Board of Endodontics Written Exam, were generated by two board-certified endodontists. The questions were submitted to the following chatbots, three times in a row: Gemini Advanced, Gemini, Microsoft Copilot, GPT-3.5, GPT-4o, GPT-4.0, and Claude 3.5 Sonnet. The chatbot was asked to choose the correct response and to explain the justification. The response to the question was considered "correct" only if the chatbot picked the right choice in ALL 3 attempts. The quality of reasoning as to why the chatbot selected the answer choice was scored using a three-ordinal scale (0, 1, 2). Two calibrated reviewers scored all 2100 responses independently. Categorical data were analyzed using Chi-square test; ordinal data were analyzed using Kruskal-Wallis and Mann-Whitney tests.
The accuracy scores ranged from 48% (Microsoft Copilot) to 71% (Gemini Advanced, GPT-3.5, and Claude 3.5 Sonnet) (P < .05). Gemini Advanced, Gemini, and Microsoft Copilot showed similar performance regardless of the question source (textbook or literature) (P > .05). GPT-3.5, GPT-4o, GPT-4.0 and Claude 3.5 Sonnet performed significantly better with textbook-based questions (P < .05). Reasoning scores showed different distribution among chatbots (P < .05). Gemini Advanced had the highest rate of score 2 (81%) and the lowest rate of score 0 (18.5%).
Comprehensive assessment of seven AI chatbots' performance on board-style endodontic questions revealed their capacities and limitations as educational resources in the field of endodontics.
本研究的目的是评估人工智能聊天机器人回答牙髓病学笔试风格问题的整体表现。
两名获得牙髓病学委员会认证的牙髓病专家编写了100道选择题形式的牙髓病学问题,其风格遵循美国牙髓病学委员会笔试。这些问题连续三次提交给以下聊天机器人:Gemini Advanced、Gemini、Microsoft Copilot、GPT - 3.5、GPT - 4o、GPT - 4.0和Claude 3.5 Sonnet。要求聊天机器人选择正确答案并解释理由。只有当聊天机器人在所有3次尝试中都选对时,对该问题的回答才被视为“正确”。使用三级量表(0、1、2)对聊天机器人选择答案的推理质量进行评分。两名经过校准的评审员独立对所有2100个回答进行评分。分类数据采用卡方检验进行分析;有序数据采用Kruskal - Wallis和Mann - Whitney检验进行分析。
准确率得分从48%(Microsoft Copilot)到71%(Gemini Advanced、GPT - 3.5和Claude 3.5 Sonnet)不等(P < 0.05)。无论问题来源是教科书还是文献,Gemini Advanced、Gemini和Microsoft Copilot表现相似(P > 0.05)。GPT - 3.5、GPT - 4o、GPT - 4.0和Claude 3.5 Sonnet在基于教科书的问题上表现明显更好(P < 0.05)。推理得分在不同聊天机器人之间呈现不同分布(P < 0.05)。Gemini Advanced得2分的比例最高(81%),得0分的比例最低(18.5%)。
对七个人工智能聊天机器人在牙髓病学笔试风格问题上的表现进行综合评估,揭示了它们作为牙髓病学领域教育资源的能力和局限性。