Cook David A, Overgaard Joshua, Pankratz V Shane, Del Fiol Guilherme, Aakre Chris A
Division of General Internal Medicine, Mayo Clinic College of Medicine and Science, Rochester, MN, United States.
Multidisciplinary Simulation Center, Mayo Clinic College of Medicine and Science, Rochester, MN, United States.
J Med Internet Res. 2025 Apr 4;27:e68486. doi: 10.2196/68486.
Virtual patients (VPs) are computer screen-based simulations of patient-clinician encounters. VP use is limited by cost and low scalability.
We aimed to show that VPs powered by large language models (LLMs) can generate authentic dialogues, accurately represent patient preferences, and provide personalized feedback on clinical performance. We also explored using LLMs to rate the quality of dialogues and feedback.
We conducted an intrinsic evaluation study rating 60 VP-clinician conversations. We used carefully engineered prompts to direct OpenAI's generative pretrained transformer (GPT) to emulate a patient and provide feedback. Using 2 outpatient medicine topics (chronic cough diagnosis and diabetes management), each with permutations representing different patient preferences, we created 60 conversations (dialogues plus feedback): 48 with a human clinician and 12 "self-chat" dialogues with GPT role-playing both the VP and clinician. Primary outcomes were dialogue authenticity and feedback quality, rated using novel instruments for which we conducted a validation study collecting evidence of content, internal structure (reproducibility), relations with other variables, and response process. Each conversation was rated by 3 physicians and by GPT. Secondary outcomes included user experience, bias, patient preferences represented in the dialogues, and conversation features that influenced authenticity.
The average cost per conversation was US $0.51 for GPT-4.0-Turbo and US $0.02 for GPT-3.5-Turbo. Mean (SD) conversation ratings, maximum 6, were overall dialogue authenticity 4.7 (0.7), overall user experience 4.9 (0.7), and average feedback quality 4.7 (0.6). For dialogues created using GPT-4.0-Turbo, physician ratings of patient preferences aligned with intended preferences in 20 to 47 of 48 dialogues (42%-98%). Subgroup comparisons revealed higher ratings for dialogues using GPT-4.0-Turbo versus GPT-3.5-Turbo and for human-generated versus self-chat dialogues. Feedback ratings were similar for human-generated versus GPT-generated ratings, whereas authenticity ratings were lower. We did not perceive bias in any conversation. Dialogue features that detracted from authenticity included that GPT was verbose or used atypical vocabulary (93/180, 51.7% of conversations), was overly agreeable (n=56, 31%), repeated the question as part of the response (n=47, 26%), was easily convinced by clinician suggestions (n=35, 19%), or was not disaffected by poor clinician performance (n=32, 18%). For feedback, detractors included excessively positive feedback (n=42, 23%), failure to mention important weaknesses or strengths (n=41, 23%), or factual inaccuracies (n=39, 22%). Regarding validation of dialogue and feedback scores, items were meticulously developed (content evidence), and we confirmed expected relations with other variables (higher ratings for advanced LLMs and human-generated dialogues). Reproducibility was suboptimal, due largely to variation in LLM performance rather than rater idiosyncrasies.
LLM-powered VPs can simulate patient-clinician dialogues, demonstrably represent patient preferences, and provide personalized performance feedback. This approach is scalable, globally accessible, and inexpensive. LLM-generated ratings of feedback quality are similar to human ratings.
虚拟患者(VPs)是基于计算机屏幕的医患互动模拟。虚拟患者的使用受到成本和低可扩展性的限制。
我们旨在证明由大语言模型(LLMs)驱动的虚拟患者能够生成真实的对话,准确呈现患者偏好,并提供关于临床操作的个性化反馈。我们还探索了使用大语言模型对对话和反馈的质量进行评分。
我们进行了一项内在评估研究,对60次虚拟患者 - 临床医生对话进行评分。我们使用精心设计的提示来引导OpenAI的生成式预训练变换器(GPT)模拟患者并提供反馈。利用2个门诊医学主题(慢性咳嗽诊断和糖尿病管理),每个主题都有代表不同患者偏好的排列组合,我们创建了60次对话(对话加反馈):48次与人类临床医生的对话以及12次GPT同时扮演虚拟患者和临床医生的“自我聊天”对话。主要结果是对话的真实性和反馈质量,使用我们进行了验证研究的新颖工具进行评分,该验证研究收集了内容、内部结构(可重复性)、与其他变量关系以及响应过程的证据。每次对话由3名医生和GPT进行评分。次要结果包括用户体验、偏差、对话中呈现的患者偏好以及影响真实性的对话特征。
GPT - 4.0 - Turbo每次对话的平均成本为0.51美元,GPT - 3.5 - Turbo为0.02美元。平均(标准差)对话评分,满分6分,整体对话真实性为4.7(0.7),整体用户体验为4.9(0.7),平均反馈质量为4.7(0.6)。对于使用GPT - 4.0 - Turbo创建的对话,在48次对话中的20至47次(42% - 98%)中,医生对患者偏好的评分与预期偏好一致。亚组比较显示,使用GPT - 4.0 - Turbo的对话评分高于GPT - 3.5 - Turbo的对话,人工生成的对话评分高于自我聊天对话。人工生成的反馈评分与GPT生成的反馈评分相似,而真实性评分较低。我们在任何对话中都未察觉到偏差。降低真实性的对话特征包括GPT表述冗长或使用非典型词汇(93/180,占对话的51.7%)、过于随和(n = 56,31%)、在回复中重复问题(n = 47,26%)、容易被临床医生的建议说服(n = 35,19%)或对临床医生的不佳表现无动于衷(n = 32,18%)。对于反馈,不利因素包括过度积极的反馈(n = 42,23%)、未提及重要的弱点或优点(n = 41,23%)或事实不准确(n = 39,22%)。关于对话和反馈分数的验证,项目经过精心开发(内容证据),我们确认了与其他变量的预期关系(高级大语言模型和人工生成对话的评分更高)。可重复性不太理想,主要是由于大语言模型性能的差异而非评分者的特质。
由大语言模型驱动的虚拟患者可以模拟医患对话,明显呈现患者偏好,并提供个性化的操作反馈。这种方法具有可扩展性、全球可及性且成本低廉。大语言模型生成的反馈质量评分与人工评分相似。