评估不同大语言模型在口腔修复学中的准确性、可靠性、一致性和可读性。

Evaluating the Accuracy, Reliability, Consistency, and Readability of Different Large Language Models in Restorative Dentistry.

作者信息

Ozdemir Zeyneb Merve, Yapici Emre

机构信息

Department of Restorative Dentistry, Faculty of Dentistry, Kahramanmaras Sutcu Imam University, Kahramanmaras, Turkey.

出版信息

J Esthet Restor Dent. 2025 Jul;37(7):1740-1752. doi: 10.1111/jerd.13447. Epub 2025 Mar 2.

DOI:10.1111/jerd.13447

PMID:40025833

Abstract

OBJECTIVE

This study aimed to evaluate the reliability, consistency, and readability of responses provided by various artificial intelligence (AI) programs to questions related to Restorative Dentistry.

MATERIALS AND METHODS

Forty-five knowledge-based information and 20 questions (10 patient-related and 10 dentistry-specific) were posed to ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Chatsonic, Copilot, and Gemini Advanced chatbots. The DISCERN questionnaire was used to assess the reliability; Flesch Reading Ease and Flesch-Kincaid Grade Level scores were utilized to evaluate readability. Accuracy and consistency were determined based on the chatbots' responses to the knowledge-based questions.

RESULTS

ChatGPT-4, ChatGPT-4o, Chatsonic, and Copilot demonstrated "good" reliability, while ChatGPT-3.5 and Gemini Advanced showed "fair" reliability. Chatsonic exhibited the highest "DISCERN total score" for patient-related questions, while ChatGPT-4o performed best for dentistry-specific questions. No significant differences were found in readability among the chatbots (p > 0.05). ChatGPT-4o showed the highest accuracy (93.3%) for knowledge-based questions, while Copilot had the lowest (68.9%). ChatGPT-4 demonstrated the highest consistency between repetitions.

CONCLUSION

Performance of AIs varied in terms of accuracy, reliability, consistency, and readability when responding to Restorative Dentistry questions. ChatGPT-4o and Chatsonic showed promising results for academic and patient education applications. However, the readability of responses was generally above recommended levels for patient education materials.

CLINICAL SIGNIFICANCE

The utilization of AI has an increasing impact on various aspects of dentistry. Moreover, if the responses to patient-related and dentistry-specific questions in restorative dentistry prove to be reliable and comprehensible, this may yield promising outcomes for the future.

摘要

目的

本研究旨在评估各种人工智能（AI）程序对与修复牙科学相关问题的回答的可靠性、一致性和可读性。

材料与方法

向ChatGPT-3.5、ChatGPT-4、ChatGPT-4o、Chatsonic、Copilot和Gemini Advanced聊天机器人提出了45条基于知识的信息和20个问题（10个与患者相关的问题和10个牙科特定问题）。使用DISCERN问卷评估可靠性；利用弗莱什易读性和弗莱什-金凯德年级水平得分评估可读性。根据聊天机器人对基于知识的问题的回答确定准确性和一致性。

结果

ChatGPT-4、ChatGPT-4o、Chatsonic和Copilot表现出“良好”的可靠性，而ChatGPT-3.5和Gemini Advanced表现出“一般”的可靠性。Chatsonic在与患者相关问题上的“DISCERN总分”最高，而ChatGPT-4o在牙科特定问题上表现最佳。聊天机器人之间的可读性没有显著差异（p>0.05）。ChatGPT-4o在基于知识的问题上的准确性最高（93.3%），而Copilot最低（68.9%）。ChatGPT-4在重复回答之间表现出最高的一致性。

结论

在回答修复牙科学问题时，人工智能在准确性、可靠性、一致性和可读性方面表现各异。ChatGPT-4o和Chatsonic在学术和患者教育应用方面显示出有希望的结果。然而，回答的可读性总体上高于患者教育材料的推荐水平。

临床意义

人工智能的应用对牙科的各个方面的影响日益增加。此外，如果修复牙科学中与患者相关和牙科特定问题的回答被证明是可靠且易于理解的，这可能会为未来带来有希望的结果。

相似文献

Evaluating the Accuracy, Reliability, Consistency, and Readability of Different Large Language Models in Restorative Dentistry.评估不同大语言模型在口腔修复学中的准确性、可靠性、一致性和可读性。

J Esthet Restor Dent. 2025 Jul;37(7):1740-1752. doi: 10.1111/jerd.13447. Epub 2025 Mar 2.

Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care.评估 ChatGPT®、BARD®、 Gemini®、Copilot®、Perplexity® 在姑息治疗方面的可读性、可靠性和质量。

Medicine (Baltimore). 2024 Aug 16;103(33):e39305. doi: 10.1097/MD.0000000000039305.

Assessing the quality and readability of patient education materials on chemotherapy cardiotoxicity from artificial intelligence chatbots: An observational cross-sectional study.评估人工智能聊天机器人提供的关于化疗心脏毒性的患者教育材料的质量和可读性：一项观察性横断面研究。

Medicine (Baltimore). 2025 Apr 11;104(15):e42135. doi: 10.1097/MD.0000000000042135.

Assessing the Readability of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study.评估人工智能聊天机器人提供的心脏导管插入术患者教育材料的可读性：一项观察性横断面研究。

Cureus. 2024 Jul 4;16(7):e63865. doi: 10.7759/cureus.63865. eCollection 2024 Jul.

Can artificial intelligence models serve as patient information consultants in orthodontics?人工智能模型能否在正畸学中充当患者信息顾问？

BMC Med Inform Decis Mak. 2024 Jul 29;24(1):211. doi: 10.1186/s12911-024-02619-8.

Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment.作为根管再治疗患者信息来源的人工智能聊天机器人回复的可读性、准确性、恰当性和质量：一项比较评估。

Int J Med Inform. 2025 Sep;201:105948. doi: 10.1016/j.ijmedinf.2025.105948. Epub 2025 Apr 25.

Assessing chatbots ability to produce leaflets on cataract surgery: Bing AI, chatGPT 3.5, chatGPT 4o, ChatSonic, Google Bard, Perplexity, and Pi.评估聊天机器人生成白内障手术宣传册的能力：必应人工智能、ChatGPT 3.5、ChatGPT 4、ChatSonic、谷歌巴德、Perplexity和Pi。

J Cataract Refract Surg. 2025 May 1;51(5):371-375. doi: 10.1097/j.jcrs.0000000000001622.

Evaluation of the reliability and readability of answers given by chatbots to frequently asked questions about endophthalmitis: A cross-sectional study on chatbots.评估聊天机器人对眼内炎常见问题回答的可靠性和可读性：一项关于聊天机器人的横断面研究。

Health Informatics J. 2024 Oct-Dec;30(4):14604582241304679. doi: 10.1177/14604582241304679.

Assessing the readability, reliability, and quality of artificial intelligence chatbot responses to the 100 most searched queries about cardiopulmonary resuscitation: An observational study.评估人工智能聊天机器人对心肺复苏术 100 个最常见查询的回答的易读性、可靠性和质量：一项观察性研究。

Medicine (Baltimore). 2024 May 31;103(22):e38352. doi: 10.1097/MD.0000000000038352.

Evaluación de la fiabilidad y legibilidad de las respuestas de los chatbots como recurso de información al paciente para las exploraciones PET-TC más communes.评估聊天机器人回复作为常见PET-CT检查患者信息资源的可靠性和可读性。

Rev Esp Med Nucl Imagen Mol (Engl Ed). 2025 Jan-Feb;44(1):500065. doi: 10.1016/j.remnie.2024.500065. Epub 2024 Sep 28.

引用本文的文献

Evaluation of Chatbot Responses to Text-Based Multiple-Choice Questions in Prosthodontic and Restorative Dentistry.口腔修复学和牙体修复学中聊天机器人对基于文本的多项选择题的回答评估

Dent J (Basel). 2025 Jun 21;13(7):279. doi: 10.3390/dj13070279.

本文引用的文献

Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval.通过提示工程和知识检索评估大语言模型在注册营养师考试中的准确性和一致性。

Sci Rep. 2025 Jan 9;15(1):1506. doi: 10.1038/s41598-024-85003-w.

A Performance Evaluation of Large Language Models in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity.大语言模型在圆锥角膜中的性能评估：ChatGPT-3.5、ChatGPT-4.0、Gemini、Copilot、Chatsonic和Perplexity的比较研究

J Clin Med. 2024 Oct 30;13(21):6512. doi: 10.3390/jcm13216512.

Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness.评估大型语言模型（ChatGPT-4、Claude 3、Gemini和Microsoft Copilot）对早产儿视网膜病变常见问题的回答：一项关于可读性和适宜性的研究

J Pediatr Ophthalmol Strabismus. 2025 Mar-Apr;62(2):84-95. doi: 10.3928/01913913-20240911-05. Epub 2024 Oct 28.

Evaluating the reliability of the responses of large language models to keratoconus-related questions.评估大语言模型对圆锥角膜相关问题回答的可靠性。

Clin Exp Optom. 2024 Oct 24:1-8. doi: 10.1080/08164622.2024.2419524.

Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments.大型语言人工智能模型在解决修复牙科和牙髓学生评估方面的性能。

Clin Oral Investig. 2024 Oct 7;28(11):575. doi: 10.1007/s00784-024-05968-w.

Accuracy of ChatGPT responses on tracheotomy for patient education.ChatGPT 回答在患者教育中关于气管切开术的准确性。

Eur Arch Otorhinolaryngol. 2024 Nov;281(11):6167-6172. doi: 10.1007/s00405-024-08859-8. Epub 2024 Oct 2.

Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions.大语言模型与北美药师执照考试（NAPLEX）练习题。

Am J Pharm Educ. 2024 Nov;88(11):101294. doi: 10.1016/j.ajpe.2024.101294. Epub 2024 Sep 20.

Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination.与GPT-3.5、GPT-4和GPT-4o相比，定制生成式预训练变换器（Custom GPTs）在提升性能和证据方面如何？一项关于急诊医学专科考试的研究。

Healthcare (Basel). 2024 Aug 30;12(17):1726. doi: 10.3390/healthcare12171726.

Evaluation of the accuracy and readability of ChatGPT-4 and Google Gemini in providing information on retinal detachment: a multicenter expert comparative study.ChatGPT-4和谷歌Gemini在提供视网膜脱离信息方面的准确性和可读性评估：一项多中心专家对比研究。

Int J Retina Vitreous. 2024 Sep 2;10(1):61. doi: 10.1186/s40942-024-00579-9.

Advancing Medical Education: Performance of Generative Artificial Intelligence Models on Otolaryngology Board Preparation Questions With Image Analysis Insights.推进医学教育：生成式人工智能模型在耳鼻喉科委员会备考问题上的表现及图像分析见解

Cureus. 2024 Jul 9;16(7):e64204. doi: 10.7759/cureus.64204. eCollection 2024 Jul.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估不同大语言模型在口腔修复学中的准确性、可靠性、一致性和可读性。

Evaluating the Accuracy, Reliability, Consistency, and Readability of Different Large Language Models in Restorative Dentistry.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSION

CLINICAL SIGNIFICANCE

目的

材料与方法

结果

结论

临床意义

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献