口腔修复学和牙体修复学中聊天机器人对基于文本的多项选择题的回答评估

Evaluation of Chatbot Responses to Text-Based Multiple-Choice Questions in Prosthodontic and Restorative Dentistry.

作者信息

Chau Reinhard Chun Wang, Thu Khaing Myat, Yu Ollie Yiru, Hsung Richard Tai-Chiu, Wang Denny Chon Pei, Man Manuel Wing Ho, Wang John Junwen, Lam Walter Yu Hang

机构信息

Faculty of Dentistry, The University of Hong Kong, Hong Kong 999077, China.

Department of Computer Science, Hong Kong Chu Hai College, Hong Kong 999077, China.

出版信息

Dent J (Basel). 2025 Jun 21;13(7):279. doi: 10.3390/dj13070279.

DOI:10.3390/dj13070279

PMID:40710124

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12293279/

Abstract

: This study aims to evaluate the response accuracy and quality of three AI chatbots-GPT-4.0, Claude-2, and Llama-2-in answering multiple-choice questions in prosthodontic and restorative dentistry. : A total of 191 text-based multiple-choice questions were selected from the prosthodontic and restorative dentistry sections of the United States Integrated National Board Dental Examination (INBDE) (n = 80) and the United Kingdom Overseas Registration Examination (ORE) ( = 111). These questions were inputted into the chatbots, and the AI-generated answers were compared with the official answer keys to determine their accuracy. Additionally, two dental specialists independently evaluated the rationales accompanying each chatbot response for accuracy, relevance, and comprehensiveness, categorizing them into four distinct ratings. Chi-square and post hoc Z-tests with Bonferroni adjustment were used to analyze the responses. The inter-rater reliability for evaluating the quality of the rationale ratings among specialists was assessed using Cohen's Kappa (κ). : GPT-4.0 (65.4%; = 125/191) demonstrated a significantly higher proportion of correctly answered multiple-choice questions when compared to Claude-2 (41.9%; = 80/191) ( < 0.017) and Llama-2 (26.2%; = 50/191) ( < 0.017). Significant differences were observed in the answer accuracy among all of the chatbots ( < 0.001). In terms of the rationale quality, GPT-4.0 (58.1%; = 111/191) had a significantly higher proportion of "Correct Answer, Correct Rationale" than Claude-2 (37.2%; = 71/191) ( < 0.017) and Llama-2 (24.1%; = 46/191) ( < 0.017). Significant differences were observed in the rationale quality among all of the chatbots ( < 0.001). The inter-rater reliability was very high (κ = 0.83). : GPT-4.0 demonstrated the highest accuracy and quality of reasoning in responding to prosthodontic and restorative dentistry questions. This underscores the varying efficacy of AI chatbots within specialized dental contexts.

摘要

本研究旨在评估三款人工智能聊天机器人——GPT-4.0、Claude-2和Llama-2——在回答口腔修复学和牙体修复学多项选择题时的回答准确性和质量。

总共从美国综合国家委员会牙科考试（INBDE）（n = 80）和英国海外注册考试（ORE）（n = 111）的口腔修复学和牙体修复学部分选取了191道基于文本的多项选择题。将这些问题输入到聊天机器人中，并将人工智能生成的答案与官方答案进行比较，以确定其准确性。此外，两名牙科专家独立评估每个聊天机器人回答所附带的理由的准确性、相关性和全面性，并将其分为四个不同的等级。使用卡方检验和经Bonferroni校正的事后Z检验来分析回答。使用Cohen's Kappa（κ）评估专家之间评估理由等级质量的评分者间信度。

与Claude-2（41.9%；n = 80/191）（P < 0.017）和Llama-2（26.2%；n = 50/191）（P < 0.017）相比，GPT-4.0（65.4%；n = 125/191）在正确回答多项选择题方面的比例显著更高。在所有聊天机器人的答案准确性方面观察到显著差异（P < 0.001）。在理由质量方面，GPT-4.0（58.1%；n = 111/191）“答案正确，理由正确”的比例显著高于Claude-2（37.2%；n = 71/191）（P < 0.017）和Llama-2（24.1%；n = 46/191）（P < 0.017）。在所有聊天机器人的理由质量方面观察到显著差异（P < 0.001）。评分者间信度非常高（κ = 0.83）。

GPT-4.0在回答口腔修复学和牙体修复学问题时表现出最高的准确性和推理质量。这凸显了人工智能聊天机器人在专业牙科背景下的不同功效。