Büker Mine, Mercan Gamze
Department of Endodontics, Faculty of Dentistry, Mersin University, Mersin, Turkey.
Int J Med Inform. 2025 Sep;201:105948. doi: 10.1016/j.ijmedinf.2025.105948. Epub 2025 Apr 25.
This study aimed to assess the readability, accuracy, appropriateness, and overall quality of responses generated by large language models (LLMs), including ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash), to frequently asked questions (FAQs) related to root canal retreatment.
Three LLM chatbots-ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash)-were assessed based on their responses to 10 patient FAQs. Readability was analyzed using seven indices, including Flesch reading ease score (FRES), Flesch-Kincaid grade level (FKGL), Simple Measure of Gobbledygook (SMOG), gunning FOG (GFOG), Linsear Write (LW), Coleman-Liau (CL), and automated readability index (ARI), and compared against the recommended sixth-grade reading level. Response quality was evaluated using the Global Quality Scale (GQS), while accuracy and appropriateness were rated on a five-point Likert scale by two independent reviewers. Statistical analyses were conducted using one-way ANOVA, Tukey or Games-Howell post-hoc tests for continuous variables. Spearman's correlation test was used to assess associations between categorical variables.
All chatbots generated responses exceeding the recommended readability level, making them suitable for readers at or above the 10th-grade level. No significant difference was found between ChatGPT-3.5 and Microsoft Copilot, while Gemini produced significantly more readable responses (p < 0.05). Gemini demonstrated the highest proportion of accurate (80 %) and high-quality responses (80 %) compared to ChatGPT-3.5 and Microsoft Copilot.
None of the chatbots met the recommended readability standards for patient education materials. While Gemini demonstrated better readability, accuracy, and quality, all three models require further optimization to enhance accessibility and reliability in patient communication.
本研究旨在评估大型语言模型(LLM),包括ChatGPT-3.5、Microsoft Copilot和Gemini(版本2.0 Flash)对根管再治疗常见问题(FAQ)的回答的可读性、准确性、恰当性和整体质量。
基于三个LLM聊天机器人——ChatGPT-3.5、Microsoft Copilot和Gemini(版本2.0 Flash)对10个患者常见问题的回答进行评估。使用七个指标分析可读性,包括弗莱什易读性分数(FRES)、弗莱什-金凯德年级水平(FKGL)、难词简易衡量法(SMOG)、冈宁雾度指数(GFOG)、林西厄书写指数(LW)、科尔曼-廖指数(CL)和自动可读性指数(ARI),并与推荐的六年级阅读水平进行比较。使用全球质量量表(GQS)评估回答质量,而准确性和恰当性由两名独立评审员根据五点李克特量表进行评分。对连续变量使用单因素方差分析、Tukey或Games-Howell事后检验进行统计分析。使用斯皮尔曼相关性检验评估分类变量之间的关联。
所有聊天机器人生成的回答都超过了推荐的可读性水平,使其适合十年级及以上水平的读者。ChatGPT-3.5和Microsoft Copilot之间未发现显著差异,而Gemini生成的回答可读性明显更高(p < 0.05)。与ChatGPT-3.5和Microsoft Copilot相比,Gemini的准确回答比例(80%)和高质量回答比例(80%)最高。
没有一个聊天机器人达到患者教育材料推荐的可读性标准。虽然Gemini表现出更好的可读性、准确性和质量,但所有三个模型都需要进一步优化,以提高患者沟通中的可及性和可靠性。