Suppr超能文献

作为根管再治疗患者信息来源的人工智能聊天机器人回复的可读性、准确性、恰当性和质量:一项比较评估。

Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment.

作者信息

Büker Mine, Mercan Gamze

机构信息

Department of Endodontics, Faculty of Dentistry, Mersin University, Mersin, Turkey.

出版信息

Int J Med Inform. 2025 Sep;201:105948. doi: 10.1016/j.ijmedinf.2025.105948. Epub 2025 Apr 25.

Abstract

AIM

This study aimed to assess the readability, accuracy, appropriateness, and overall quality of responses generated by large language models (LLMs), including ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash), to frequently asked questions (FAQs) related to root canal retreatment.

METHODS

Three LLM chatbots-ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash)-were assessed based on their responses to 10 patient FAQs. Readability was analyzed using seven indices, including Flesch reading ease score (FRES), Flesch-Kincaid grade level (FKGL), Simple Measure of Gobbledygook (SMOG), gunning FOG (GFOG), Linsear Write (LW), Coleman-Liau (CL), and automated readability index (ARI), and compared against the recommended sixth-grade reading level. Response quality was evaluated using the Global Quality Scale (GQS), while accuracy and appropriateness were rated on a five-point Likert scale by two independent reviewers. Statistical analyses were conducted using one-way ANOVA, Tukey or Games-Howell post-hoc tests for continuous variables. Spearman's correlation test was used to assess associations between categorical variables.

RESULTS

All chatbots generated responses exceeding the recommended readability level, making them suitable for readers at or above the 10th-grade level. No significant difference was found between ChatGPT-3.5 and Microsoft Copilot, while Gemini produced significantly more readable responses (p < 0.05). Gemini demonstrated the highest proportion of accurate (80 %) and high-quality responses (80 %) compared to ChatGPT-3.5 and Microsoft Copilot.

CONCLUSIONS

None of the chatbots met the recommended readability standards for patient education materials. While Gemini demonstrated better readability, accuracy, and quality, all three models require further optimization to enhance accessibility and reliability in patient communication.

摘要

目的

本研究旨在评估大型语言模型(LLM),包括ChatGPT-3.5、Microsoft Copilot和Gemini(版本2.0 Flash)对根管再治疗常见问题(FAQ)的回答的可读性、准确性、恰当性和整体质量。

方法

基于三个LLM聊天机器人——ChatGPT-3.5、Microsoft Copilot和Gemini(版本2.0 Flash)对10个患者常见问题的回答进行评估。使用七个指标分析可读性,包括弗莱什易读性分数(FRES)、弗莱什-金凯德年级水平(FKGL)、难词简易衡量法(SMOG)、冈宁雾度指数(GFOG)、林西厄书写指数(LW)、科尔曼-廖指数(CL)和自动可读性指数(ARI),并与推荐的六年级阅读水平进行比较。使用全球质量量表(GQS)评估回答质量,而准确性和恰当性由两名独立评审员根据五点李克特量表进行评分。对连续变量使用单因素方差分析、Tukey或Games-Howell事后检验进行统计分析。使用斯皮尔曼相关性检验评估分类变量之间的关联。

结果

所有聊天机器人生成的回答都超过了推荐的可读性水平,使其适合十年级及以上水平的读者。ChatGPT-3.5和Microsoft Copilot之间未发现显著差异,而Gemini生成的回答可读性明显更高(p < 0.05)。与ChatGPT-3.5和Microsoft Copilot相比,Gemini的准确回答比例(80%)和高质量回答比例(80%)最高。

结论

没有一个聊天机器人达到患者教育材料推荐的可读性标准。虽然Gemini表现出更好的可读性、准确性和质量,但所有三个模型都需要进一步优化,以提高患者沟通中的可及性和可靠性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验