Kaur Amarpreet, Budko Alexander, Liu Katrina, Eaton Eric, Steitz Bryan D, Johnson Kevin B
Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania.
School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, Pennsylvania.
Appl Clin Inform. 2025 May;16(3):718-731. doi: 10.1055/a-2565-9155. Epub 2025 Mar 25.
Patient portals bridge patient and provider communications but exacerbate physician and nursing burnout. Large language models (LLMs) can generate message responses that are viewed favorably by health care professionals/providers (HCPs); however, these studies have not included diverse message types or new prompt-engineering strategies.Our goal is to investigate and compare the quality and precision of GPT-generated message responses versus real doctor responses across the spectrum of message types within a patient portal.We used prompt engineering techniques to craft synthetic provider responses tailored to adult primary care patients. We enrolled a sample of primary care providers in a cross-sectional study to compare authentic with synthetic patient portal message responses generated by GPT-3.5-turbo, July 2023 version (GPT). The survey assessed each response's empathy, relevance, medical accuracy, and readability on a scale from 0 to 5. Respondents were asked to identify responses that were GPT-generated versus provider-generated. Mean scores for all metrics were computed for subsequent analysis.A total of 49 HCPs participated in the survey (59% completion rate), comprising 16 physicians and 32 advanced practice providers (APPs). In comparison to responses generated by real doctors, GPT-generated responses scored statistically significantly higher than doctors in two of the four parameters: empathy ( < 0.05) and readability ( < 0.05). However, no statistically significant difference was observed for relevance and accuracy ( > 0.05). Although readability scores were significantly different, the absolute difference was small, and the clinical significance of this finding remains uncertain.Our findings affirm the potential of GPT-generated message responses to achieve comparable levels of empathy, relevance, and readability to those found in typical responses crafted by HCPs. Additional studies should be done within provider workflows and with careful evaluation of patient attitudes and concerns related to the ethics as well as the quality of generated responses in all settings.
患者门户网站架起了患者与医疗服务提供者沟通的桥梁,但加剧了医生和护士的职业倦怠。大语言模型(LLMs)可以生成受到医疗保健专业人员/提供者(HCPs)好评的消息回复;然而,这些研究并未涵盖不同的消息类型或新的提示工程策略。我们的目标是调查和比较在患者门户网站中,GPT生成的消息回复与真实医生回复在各种消息类型范围内的质量和准确性。我们使用提示工程技术来精心制作针对成年初级保健患者的模拟医疗服务提供者回复。我们招募了一组初级保健提供者参与一项横断面研究,以比较由GPT-3.5-turbo(2023年7月版本,GPT)生成的真实与模拟患者门户网站消息回复。该调查从0到5的量表评估每个回复的同理心、相关性、医学准确性和可读性。要求受访者识别出GPT生成的回复与医疗服务提供者生成的回复。计算所有指标的平均分数以供后续分析。共有49名HCPs参与了调查(完成率为59%),包括16名医生和32名高级实践提供者(APPs)。与真实医生生成的回复相比,GPT生成的回复在四个参数中的两个参数上得分在统计学上显著高于医生:同理心(<0.05)和可读性(<0.05)。然而,在相关性和准确性方面未观察到统计学上的显著差异(>0.05)。尽管可读性得分有显著差异,但绝对差异很小,这一发现的临床意义仍不确定。我们的研究结果证实了GPT生成的消息回复在实现与HCPs精心制作的典型回复相当的同理心、相关性和可读性水平方面的潜力。应在医疗服务提供者的工作流程内进行更多研究,并仔细评估患者对伦理问题以及在所有环境中生成回复的质量的态度和担忧。