Emergency Medicine Department, National University Hospital, National University Health System, Singapore, Singapore.
Department of Surgery, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
J Med Internet Res. 2024 Aug 9;26:e56413. doi: 10.2196/56413.
Patient complaints are a perennial challenge faced by health care institutions globally, requiring extensive time and effort from health care workers. Despite these efforts, patient dissatisfaction remains high. Recent studies on the use of large language models (LLMs) such as the GPT models developed by OpenAI in the health care sector have shown great promise, with the ability to provide more detailed and empathetic responses as compared to physicians. LLMs could potentially be used in responding to patient complaints to improve patient satisfaction and complaint response time.
This study aims to evaluate the performance of LLMs in addressing patient complaints received by a tertiary health care institution, with the goal of enhancing patient satisfaction.
Anonymized patient complaint emails and associated responses from the patient relations department were obtained. ChatGPT-4.0 (OpenAI, Inc) was provided with the same complaint email and tasked to generate a response. The complaints and the respective responses were uploaded onto a web-based questionnaire. Respondents were asked to rate both responses on a 10-point Likert scale for 4 items: appropriateness, completeness, empathy, and satisfaction. Participants were also asked to choose a preferred response at the end of each scenario.
There was a total of 188 respondents, of which 115 (61.2%) were health care workers. A majority of the respondents, including both health care and non-health care workers, preferred replies from ChatGPT (n=164, 87.2% to n=183, 97.3%). GPT-4.0 responses were rated higher in all 4 assessed items with all median scores of 8 (IQR 7-9) compared to human responses (appropriateness 5, IQR 3-7; empathy 4, IQR 3-6; quality 5, IQR 3-6; satisfaction 5, IQR 3-6; P<.001) and had higher average word counts as compared to human responses (238 vs 76 words). Regression analyses showed that a higher word count was a statistically significant predictor of higher score in all 4 items, with every 1-word increment resulting in an increase in scores of between 0.015 and 0.019 (all P<.001). However, on subgroup analysis by authorship, this only held true for responses written by patient relations department staff and not those generated by ChatGPT which received consistently high scores irrespective of response length.
This study provides significant evidence supporting the effectiveness of LLMs in resolution of patient complaints. ChatGPT demonstrated superiority in terms of response appropriateness, empathy, quality, and overall satisfaction when compared against actual human responses to patient complaints. Future research can be done to measure the degree of improvement that artificial intelligence generated responses can bring in terms of time savings, cost-effectiveness, patient satisfaction, and stress reduction for the health care system.
患者投诉是全球医疗机构面临的一个长期挑战,需要医护人员投入大量的时间和精力。尽管做出了这些努力,患者的不满情绪仍然很高。最近的研究表明,大型语言模型(LLMs)在医疗保健领域的应用具有很大的潜力,与医生相比,它们能够提供更详细和更有同理心的回复。LLMs 可以用于回复患者投诉,以提高患者满意度和投诉响应时间。
本研究旨在评估 LLM 在处理三级医疗机构收到的患者投诉方面的性能,以提高患者满意度。
获取患者关系部门的匿名患者投诉电子邮件和相关回复。向 ChatGPT-4.0(OpenAI,Inc.)提供相同的投诉电子邮件,并要求其生成回复。将投诉和相应的回复上传到一个基于网络的问卷上。受访者被要求对 4 项内容(适当性、完整性、同理心和满意度)分别对这两个回复进行 10 分制的李克特量表评分。参与者还被要求在每个场景结束时选择他们更喜欢的回复。
共有 188 名受访者,其中 115 名(61.2%)是医护人员。包括医护人员和非医护人员在内的大多数受访者更喜欢 ChatGPT 的回复(n=164,87.2%比 n=183,97.3%)。与人类回复相比,GPT-4.0 的回复在所有 4 项评估项目中得分更高,所有中位数均为 8(IQR 7-9),而人类回复为 5(IQR 3-7)(同理心 4,IQR 3-6;质量 5,IQR 3-6;满意度 5,IQR 3-6;P<.001),并且回复的平均字数也高于人类回复(238 比 76 字)。回归分析表明,字数的增加是所有 4 项得分更高的一个统计学上显著的预测因素,每增加 1 个字,得分就会增加 0.015 到 0.019(均 P<.001)。然而,在按作者进行的亚组分析中,这只适用于患者关系部门工作人员撰写的回复,而不适用于 ChatGPT 生成的回复,因为无论回复的长度如何,ChatGPT 生成的回复都始终获得很高的分数。
本研究提供了支持 LLM 在解决患者投诉方面有效性的重要证据。与实际的患者投诉回复相比,ChatGPT 在回复的适当性、同理心、质量和整体满意度方面表现出优越性。未来的研究可以衡量人工智能生成的回复在节省时间、成本效益、患者满意度和减轻医疗保健系统压力方面带来的改进程度。