人工智能聊天机器人在回复癌症患者基于文本的异步消息中的应用:比较研究
Application of AI Chatbot in Responding to Asynchronous Text-Based Messages From Patients With Cancer: Comparative Study.
作者信息
Bai Xuexue, Wang Shiyong, Zhao Yuanli, Feng Ming, Ma Wenbin, Liu Xiaomin
机构信息
Department of Neurosurgery, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China.
Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China.
出版信息
J Med Internet Res. 2025 May 21;27:e67462. doi: 10.2196/67462.
BACKGROUND
Telemedicine, which incorporates artificial intelligence such as chatbots, offers significant potential for enhancing health care delivery. However, the efficacy of artificial intelligence chatbots compared to human physicians in clinical settings remains underexplored, particularly in complex scenarios involving patients with cancer and asynchronous text-based interactions.
OBJECTIVE
This study aimed to evaluate the performance of the GPT-4 (OpenAI) chatbot in responding to asynchronous text-based medical messages from patients with cancer by comparing its responses with those of physicians across two clinical scenarios: patient education and medical decision-making.
METHODS
We collected 4257 deidentified asynchronous text-based medical consultation records from 17 oncologists across China between January 1, 2020, and March 31, 2024. Each record included patient questions, demographic data, and disease-related details. The records were categorized into two scenarios: patient education (eg, symptom explanations and test interpretations) and medical decision-making (eg, treatment planning). The GPT-4 chatbot was used to simulate physician responses to these records, with each session conducted in a new conversation to avoid cross-session interference. The chatbot responses, along with the original physician responses, were evaluated by a medical review panel (3 oncologists) and a patient panel (20 patients with cancer). The medical panel assessed completeness, accuracy, and safety using a 3-level scale, whereas the patient panel rated completeness, trustworthiness, and empathy on a 5-point ordinal scale. Statistical analyses included chi-square tests for categorical variables and Wilcoxon signed-rank tests for ordinal ratings.
RESULTS
In the patient education scenario (n=2364), the chatbot scored higher than physicians in completeness (n=2301, 97.34% vs n=2213, 93.61% for fully complete responses; P=.002), with no significant differences in accuracy or safety (P>.05). In the medical decision-making scenario (n=1893), the chatbot exhibited lower accuracy (n=1834, 96.88% vs n=1855, 97.99% for fully accurate responses; P<.001) and trustworthiness (n=860, 50.71% vs n=1766, 93.29% rated as "Moderately trustworthy" or higher; P<.001) compared with physicians. Regarding empathy, the medical review panel rated the chatbot as demonstrating higher empathy scores across both scenarios, whereas the patient review panel reached the opposite conclusion, consistently favoring physicians in empathetic communication. Errors in chatbot responses were primarily due to misinterpretations of medical terminology or the lack of updated guidelines, with 3.12% (59/1893) of its responses potentially leading to adverse outcomes, compared with 2.01% (38/1893) for physicians.
CONCLUSIONS
The GPT-4 chatbot performs comparably to physicians in patient education by providing comprehensive and empathetic responses. However, its reliability in medical decision-making remains limited, particularly in complex scenarios requiring nuanced clinical judgment. These findings underscore the chatbot's potential as a supplementary tool in telemedicine while highlighting the need for physician oversight to ensure patient safety and accuracy.
背景
结合聊天机器人等人工智能的远程医疗在改善医疗服务提供方面具有巨大潜力。然而,在临床环境中,与人类医生相比,人工智能聊天机器人的疗效仍未得到充分探索,尤其是在涉及癌症患者的复杂场景和基于文本的异步互动中。
目的
本研究旨在通过比较GPT-4(OpenAI)聊天机器人与医生在患者教育和医疗决策这两种临床场景下对癌症患者基于文本的异步医疗信息的回复,评估其性能。
方法
我们收集了2020年1月1日至2024年3月31日期间中国17位肿瘤学家的4257份匿名的基于文本的异步医疗咨询记录。每份记录包括患者问题、人口统计学数据和疾病相关细节。这些记录被分为两种场景:患者教育(如症状解释和检查解读)和医疗决策(如治疗计划)。使用GPT-4聊天机器人模拟医生对这些记录的回复,每次会话在新的对话中进行以避免跨会话干扰。聊天机器人的回复以及医生的原始回复由一个医学评审小组(3位肿瘤学家)和一个患者小组(20位癌症患者)进行评估。医学小组使用三级量表评估完整性、准确性和安全性,而患者小组使用五点有序量表对完整性、可信度和同理心进行评分。统计分析包括分类变量的卡方检验和有序评分的Wilcoxon符号秩检验。
结果
在患者教育场景(n = 2364)中,聊天机器人在完整性方面得分高于医生(完全完整回复的比例分别为n = 2301,97.34% 对比n = 2213,93.61%;P = 0.002),在准确性或安全性方面无显著差异(P > 0.05)。在医疗决策场景(n = 1893)中,与医生相比,聊天机器人的准确性(完全准确回复的比例分别为n = 1834,96.88% 对比n = 1855,97.99%;P < 0.001)和可信度(被评为“中等可信”或更高的比例分别为n = 860,50.71% 对比n = 1766,93.29%;P < 0.001)较低。关于同理心,医学评审小组认为聊天机器人在两种场景下的同理心得分更高,而患者评审小组得出相反结论,在同理心沟通方面一直更倾向于医生。聊天机器人回复中的错误主要是由于对医学术语的误解或缺乏更新的指南,其回复中有3.12%(59/1893)可能导致不良后果,而医生为2.01%(38/1893)。
结论
GPT-4聊天机器人通过提供全面且有同理心的回复,在患者教育方面表现与医生相当。然而,其在医疗决策中的可靠性仍然有限,特别是在需要细致临床判断的复杂场景中。这些发现强调了聊天机器人作为远程医疗辅助工具的潜力,同时突出了需要医生监督以确保患者安全和准确性。