文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

口腔修复学和牙体修复学中聊天机器人对基于文本的多项选择题的回答评估

Evaluation of Chatbot Responses to Text-Based Multiple-Choice Questions in Prosthodontic and Restorative Dentistry.

作者信息

Chau Reinhard Chun Wang, Thu Khaing Myat, Yu Ollie Yiru, Hsung Richard Tai-Chiu, Wang Denny Chon Pei, Man Manuel Wing Ho, Wang John Junwen, Lam Walter Yu Hang

机构信息

Faculty of Dentistry, The University of Hong Kong, Hong Kong 999077, China.

Department of Computer Science, Hong Kong Chu Hai College, Hong Kong 999077, China.

出版信息

Dent J (Basel). 2025 Jun 21;13(7):279. doi: 10.3390/dj13070279.


DOI:10.3390/dj13070279
PMID:40710124
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12293279/
Abstract

: This study aims to evaluate the response accuracy and quality of three AI chatbots-GPT-4.0, Claude-2, and Llama-2-in answering multiple-choice questions in prosthodontic and restorative dentistry. : A total of 191 text-based multiple-choice questions were selected from the prosthodontic and restorative dentistry sections of the United States Integrated National Board Dental Examination (INBDE) (n = 80) and the United Kingdom Overseas Registration Examination (ORE) ( = 111). These questions were inputted into the chatbots, and the AI-generated answers were compared with the official answer keys to determine their accuracy. Additionally, two dental specialists independently evaluated the rationales accompanying each chatbot response for accuracy, relevance, and comprehensiveness, categorizing them into four distinct ratings. Chi-square and post hoc Z-tests with Bonferroni adjustment were used to analyze the responses. The inter-rater reliability for evaluating the quality of the rationale ratings among specialists was assessed using Cohen's Kappa (κ). : GPT-4.0 (65.4%; = 125/191) demonstrated a significantly higher proportion of correctly answered multiple-choice questions when compared to Claude-2 (41.9%; = 80/191) ( < 0.017) and Llama-2 (26.2%; = 50/191) ( < 0.017). Significant differences were observed in the answer accuracy among all of the chatbots ( < 0.001). In terms of the rationale quality, GPT-4.0 (58.1%; = 111/191) had a significantly higher proportion of "Correct Answer, Correct Rationale" than Claude-2 (37.2%; = 71/191) ( < 0.017) and Llama-2 (24.1%; = 46/191) ( < 0.017). Significant differences were observed in the rationale quality among all of the chatbots ( < 0.001). The inter-rater reliability was very high (κ = 0.83). : GPT-4.0 demonstrated the highest accuracy and quality of reasoning in responding to prosthodontic and restorative dentistry questions. This underscores the varying efficacy of AI chatbots within specialized dental contexts.

摘要

本研究旨在评估三款人工智能聊天机器人——GPT-4.0、Claude-2和Llama-2——在回答口腔修复学和牙体修复学多项选择题时的回答准确性和质量。

总共从美国综合国家委员会牙科考试(INBDE)(n = 80)和英国海外注册考试(ORE)(n = 111)的口腔修复学和牙体修复学部分选取了191道基于文本的多项选择题。将这些问题输入到聊天机器人中,并将人工智能生成的答案与官方答案进行比较,以确定其准确性。此外,两名牙科专家独立评估每个聊天机器人回答所附带的理由的准确性、相关性和全面性,并将其分为四个不同的等级。使用卡方检验和经Bonferroni校正的事后Z检验来分析回答。使用Cohen's Kappa(κ)评估专家之间评估理由等级质量的评分者间信度。

与Claude-2(41.9%;n = 80/191)(P < 0.017)和Llama-2(26.2%;n = 50/191)(P < 0.017)相比,GPT-4.0(65.4%;n = 125/191)在正确回答多项选择题方面的比例显著更高。在所有聊天机器人的答案准确性方面观察到显著差异(P < 0.001)。在理由质量方面,GPT-4.0(58.1%;n = 111/191)“答案正确,理由正确”的比例显著高于Claude-2(37.2%;n = 71/191)(P < 0.017)和Llama-2(24.1%;n = 46/191)(P < 0.017)。在所有聊天机器人的理由质量方面观察到显著差异(P < 0.001)。评分者间信度非常高(κ = 0.83)。

GPT-4.0在回答口腔修复学和牙体修复学问题时表现出最高的准确性和推理质量。这凸显了人工智能聊天机器人在专业牙科背景下的不同功效。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6df4/12293279/4bcaa152a893/dentistry-13-00279-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6df4/12293279/4bcaa152a893/dentistry-13-00279-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6df4/12293279/4bcaa152a893/dentistry-13-00279-g001.jpg

相似文献

[1]
Evaluation of Chatbot Responses to Text-Based Multiple-Choice Questions in Prosthodontic and Restorative Dentistry.

Dent J (Basel). 2025-6-21

[2]
Performance of 7 Artificial Intelligence Chatbots on Board-style Endodontic Questions.

J Endod. 2025-6-26

[3]
Performance of Multimodal Artificial Intelligence Chatbots Evaluated on Clinical Oncology Cases.

JAMA Netw Open. 2024-10-1

[4]
Accuracy and Reliability of Artificial Intelligence Chatbots as Public Information Sources in Implant Dentistry.

Int J Oral Maxillofac Implants. 2025-6-25

[5]
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

J Med Internet Res. 2025-5-20

[6]
Comparison of artificial intelligence systems in answering prosthodontics questions from the dental specialty exam in Turkey.

J Dent Sci. 2025-7

[7]
Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study.

Front Digit Health. 2025-6-27

[8]
Performance of AI Chatbots in Preliminary Diagnosis of Maxillofacial Pathologies.

Med Sci Monit. 2025-7-9

[9]
Large language models for the screening step in systematic reviews in dentistry.

J Dent. 2025-9

[10]
Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot.

Anat Sci Educ. 2025-5-11

引用本文的文献

[1]
Brush, byte, and bot: quality comparison of artificial intelligence-generated pediatric dental advice across ChatGPT, Gemini, and Copilot.

Front Oral Health. 2025-8-15

[2]
Optimizing Preclinical Skill Assessment for Handpiece-Naïve Students: A Strategic Approach.

Dent J (Basel). 2025-8-11

本文引用的文献

[1]
Evaluating the validity and consistency of artificial intelligence chatbots in responding to patients' frequently asked questions in prosthodontics.

J Prosthet Dent. 2025-4-7

[2]
Evaluating the Accuracy, Reliability, Consistency, and Readability of Different Large Language Models in Restorative Dentistry.

J Esthet Restor Dent. 2025-7

[3]
External Validation of an AI mHealth Tool for Gingivitis Detection among Older Adults at Daycare Centers: A Pilot Study.

Int Dent J. 2025-6

[4]
The virtual assessment in dental education: A narrative review.

J Dent Sci. 2024-12

[5]
Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments.

Clin Oral Investig. 2024-10-7

[6]
Self-monitoring of Oral Health Using Smartphone Selfie Powered by Artificial Intelligence: Implications for Preventive Dentistry.

Oral Health Prev Dent. 2024-9-23

[7]
The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis.

J Med Internet Res. 2024-9-10

[8]
Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: A pilot study.

J Dent. 2024-5

[9]
Performance of Generative Artificial Intelligence in Dental Licensing Examinations.

Int Dent J. 2024-6

[10]
Artificial intelligence in dental education: ChatGPT's performance on the periodontic in-service examination.

J Periodontol. 2024-7

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索