Özyemişci Nuran, Bal Bilge Turhan, Güngör Merve Bankoğlu, Öztürk Esra Kaynak, Canvar Ayşegül, Nemli Secil Karakoca
Associate Professor, Dental Prosthesis Technology, Vocational School of Health Services, Hacettepe University, Ankara, Turkey.
Professor, Department of Prosthodontics, Faculty of Dentistry, Gazi University, Ankara, Turkey.
J Prosthet Dent. 2025 Sep 8. doi: 10.1016/j.prosdent.2025.08.028.
Despite advances in artificial intelligence (AI), the quality, reliability, and understandability of health-related information provided by chatbots is still a question mark. Furthermore, studies on maxillofacial prosthesis (MP) information from AI chatbots are lacking.
The purpose of this study was to assess and compare the reliability, quality, readability, and similarity of responses to MP-related questions generated by 4 different chatbots.
A total of 15 questions were provided by a maxillofacial prosthodontist and from 4 different chatbots (ChatGPT-3.5, Gemini 2.5 Flash, Copilot, and DeepSeek V3). The Reliability Scoring (adapted DISCERN), the Global Quality Scale (GQS), the Flesch Reading Ease Score (FRES), the Flesch-Kincaid Reading Grade Level (FKRGL), and the Similarity Index (iThenticate) were used to evaluate the performance of chatbots. Data were compared using the Kruskal-Wallis test, and the differences between chatbots were determined by the Conover multiple comparison test with Benjamini-Hochberg correction (α=.05).
There were no significant differences between the chatbots' DISCERN scores, except for one question where ChatGPT showed significantly higher reliability than Gemini or Copilot (P=.03). There was no statistically significant difference among AI tools in terms of GQS values (P=.096), FRES values (P=.166), and FKRGL values (P=.247). The similarity rate of Gemini was statistically higher than other AI chatbots (P=.03).
ChatGPT-3.5, Gemini 2.5 Flash, Copilot, and DeepSeek V3 showed good quality responses. All chatbots' responses were difficult for non-professionals to read and understand. Low similarity rates were found for all chatbots except Gemini, indicating originality of their information.
尽管人工智能(AI)取得了进展,但聊天机器人提供的与健康相关信息的质量、可靠性和可理解性仍是个问号。此外,缺乏对来自人工智能聊天机器人的颌面修复体(MP)信息的研究。
本研究的目的是评估和比较4种不同聊天机器人对MP相关问题的回答的可靠性、质量、可读性和相似度。
一位口腔颌面修复医生提供了总共15个问题,并由4种不同的聊天机器人(ChatGPT-3.5、Gemini 2.5 Flash、Copilot和DeepSeek V3)进行回答。使用可靠性评分(改编的DISCERN)、全球质量量表(GQS)、弗莱什阅读简易度评分(FRES)、弗莱什-金凯德阅读年级水平(FKRGL)和相似度指数(iThenticate)来评估聊天机器人的性能。使用Kruskal-Wallis检验比较数据,并通过带有Benjamini-Hochberg校正的Conover多重比较检验确定聊天机器人之间的差异(α = 0.05)。
除了一个问题ChatGPT的可靠性显著高于Gemini或Copilot(P = 0.03)外,聊天机器人的DISCERN评分之间没有显著差异。在GQS值(P = 0.096)、FRES值(P = 0.166)和FKRGL值(P = 0.247)方面,人工智能工具之间没有统计学上的显著差异。Gemini的相似度率在统计学上高于其他人工智能聊天机器人(P = 0.03)。
ChatGPT-3.5、Gemini 2.5 Flash、Copilot和DeepSeek V3显示出质量良好的回答。所有聊天机器人的回答对于非专业人士来说都难以阅读和理解。除Gemini外,所有聊天机器人的相似度率都很低,表明其信息具有原创性。