Emile Sameh Hany, Horesh Nir, Garoufalia Zoe, Gefen Rachel, Boutros Marylise, Wexner Steven D
Ellen Leifer Shulman and Steven Shulman Digestive Disease Center, Cleveland Clinic Florida, Weston, FL, USA.
Colorectal Surgery Unit, General Surgery Department, Mansoura University Hospitals, Mansoura, Egypt.
Am Surg. 2026 Jan;92(1):258-269. doi: 10.1177/00031348251367031. Epub 2025 Aug 11.
BackgroundChatbots and large language models, particularly ChatGPT, have led to an increasing number of studies on the potential for chatbots in patient education. In this systematic review, we aimed to provide a pooled assessment of the appropriateness and accuracy of chatbot responses in patient education across various medical disciplines.MethodsThis was a PRISMA-compliant systematic review and meta-analysis. PubMed and Scopus were searched from January-August 2023. Eligible studies that assessed the utility of chatbots in patient education were included. Primary outcomes were the appropriateness and quality of chatbot responses. Secondary outcomes included readability and concordance with published guidelines and Google searches. A random-effect proportional meta-analysis was used for pooling data.ResultsFollowing initial screening, 21 studies were included. The pooled rate of appropriateness of chatbot answers was 89.1% (95%CI: 84.9%-93.3%). ChatGPT was the most assessed chatbot. Responses, while accurate, were found to be at a college reading level as the weighted mean Flesh-Kincaid Grade Level was 13.1 (95%CI: 11.7-14.5) and the weighted mean Flesch Reading Ease Score was 38.6 (95%CI: 29- 48.2). Answers of chatbots to questions relevant to patient education had 78.6%-95% concordance with published guidelines in colorectal surgery and urology. Chatbots had higher patient education scores (87% vs 78%) than Google Search.ConclusionsChatbots provide largely accurate and appropriate answers for patient education. The advanced reading level of chatbot responses might be a limitation to their wide adoption as a source for patient education. However, they outperform traditional search engines and align well with professional guidelines, showcasing their potential in patient education.
背景
聊天机器人和大语言模型,尤其是ChatGPT,引发了越来越多关于聊天机器人在患者教育中潜力的研究。在这项系统评价中,我们旨在对聊天机器人在不同医学学科患者教育中的回复适宜性和准确性进行汇总评估。
方法
这是一项符合PRISMA标准的系统评价和荟萃分析。于2023年1月至8月检索了PubMed和Scopus。纳入评估聊天机器人在患者教育中效用的合格研究。主要结局是聊天机器人回复的适宜性和质量。次要结局包括可读性以及与已发表指南和谷歌搜索结果的一致性。采用随机效应比例荟萃分析来汇总数据。
结果
初步筛选后,纳入了21项研究。聊天机器人答案的适宜性汇总率为89.1%(95%CI:84.9%-93.3%)。ChatGPT是被评估最多的聊天机器人。虽然回复准确,但发现其阅读水平相当于大学水平,加权平均弗莱什-金凯德年级水平为13.1(95%CI:11.7-14.5),加权平均弗莱什阅读简易度分数为38.6(95%CI:29-48.2)。聊天机器人对与患者教育相关问题的回答与结直肠外科和泌尿外科已发表指南的一致性为78.6%-95%。聊天机器人的患者教育得分(87%对78%)高于谷歌搜索。
结论
聊天机器人为患者教育提供了基本准确和适宜的答案。聊天机器人回复的高级阅读水平可能是其作为患者教育来源被广泛采用的一个限制因素。然而,它们优于传统搜索引擎,并且与专业指南契合良好,展现了其在患者教育中的潜力。