由大语言模型模拟的合成医患对话:多维评估

Synthetic Patient-Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation.

作者信息

Haider Syed Ali, Prabha Srinivasagam, Gomez-Cabello Cesar Abraham, Borna Sahar, Genovese Ariana, Trabilsy Maissa, Collaco Bernardo G, Wood Nadia G, Bagaria Sanjay, Tao Cui, Forte Antonio Jorge

机构信息

Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA.

Department of Radiology AI IT, Mayo Clinic, Rochester, MN 55905, USA.

出版信息

Sensors (Basel). 2025 Jul 10;25(14):4305. doi: 10.3390/s25144305.

Abstract

BACKGROUND

Data accessibility remains a significant barrier in healthcare AI due to privacy constraints and logistical challenges. Synthetic data, which mimics real patient information while remaining both realistic and non-identifiable, offers a promising solution. Large Language Models (LLMs) create new opportunities to generate high-fidelity clinical conversations between patients and physicians. However, the value of this synthetic data depends on careful evaluation of its realism, accuracy, and practical relevance.

OBJECTIVE

To assess the performance of four leading LLMs: ChatGPT 4.5, ChatGPT 4o, Claude 3.7 Sonnet, and Gemini Pro 2.5 in generating synthetic transcripts of patient-physician interactions in plastic surgery scenarios.

METHODS

Each model generated transcripts for ten plastic surgery scenarios. Transcripts were independently evaluated by three clinically trained raters using a seven-criterion rubric: Medical Accuracy, Realism, Persona Consistency, Fidelity, Empathy, Relevancy, and Usability. Raters were blinded to the model identity to reduce bias. Each was rated on a 5-point Likert scale, yielding 840 total evaluations. Descriptive statistics were computed, and a two-way repeated measures ANOVA was used to test for differences across models and metrics. In addition, transcripts were analyzed using automated linguistic and content-based metrics.

RESULTS

All models achieved strong performance, with mean ratings exceeding 4.5 across all criteria. Gemini 2.5 Pro received mean scores (5.00 ± 0.00) in Medical Accuracy, Realism, Persona Consistency, Relevancy, and Usability. Claude 3.7 Sonnet matched the scores in Persona Consistency and Relevancy and led in Empathy (4.96 ± 0.18). ChatGPT 4.5 also achieved perfect scores in Relevancy, with high scores in Empathy (4.93 ± 0.25) and Usability (4.96 ± 0.18). ChatGPT 4o demonstrated consistently strong but slightly lower performance across most dimensions. ANOVA revealed no statistically significant differences across models (F(3, 6) = 0.85, = 0.52). Automated analysis showed substantial variation in transcript length, style, and content richness: Gemini 2.5 Pro generated the longest and most emotionally expressive dialogues, while ChatGPT 4o produced the shortest and most concise outputs.

CONCLUSIONS

Leading LLMs can generate medically accurate, emotionally appropriate synthetic dialogues suitable for educational and research use. Despite high performance, demographic homogeneity in generated patients highlights the need for improved diversity and bias mitigation in model outputs. These findings support the cautious, context-aware integration of LLM-generated dialogues into medical training, simulation, and research.

摘要

背景

由于隐私限制和后勤挑战,数据可访问性仍然是医疗保健人工智能中的一个重大障碍。合成数据在模拟真实患者信息的同时保持现实性和不可识别性,提供了一个有前景的解决方案。大语言模型(LLMs)为生成患者与医生之间的高保真临床对话创造了新机会。然而,这种合成数据的价值取决于对其现实性、准确性和实际相关性的仔细评估。

目的

评估四种领先的大语言模型:ChatGPT 4.5、ChatGPT 4o、Claude 3.7 Sonnet和Gemini Pro 2.5在生成整形外科场景中患者与医生互动的合成记录方面的性能。

方法

每个模型为十个整形外科场景生成记录。记录由三名经过临床培训的评分者使用七项标准的评分量表进行独立评估:医学准确性、现实性、角色一致性、逼真度、同理心、相关性和可用性。评分者对模型身份不知情以减少偏差。每项在5点李克特量表上进行评分,共产生840次评估。计算描述性统计数据,并使用双向重复测量方差分析来检验模型和指标之间的差异。此外,使用自动语言和基于内容的指标对记录进行分析。

结果

所有模型都表现出色,所有标准的平均评分都超过4.5。Gemini 2.5 Pro在医学准确性、现实性、角色一致性、相关性和可用性方面获得平均分数(5.00 ± 0.00)。Claude 3.7 Sonnet在角色一致性和相关性方面得分与之匹配,并在同理心方面领先(4.96 ± 0.18)。ChatGPT 4.5在相关性方面也获得了满分,在同理心(4.93 ± 0.25)和可用性(4.96 ± 0.18)方面得分较高。ChatGPT 4o在大多数维度上表现出始终强劲但略低的性能。方差分析显示模型之间没有统计学上的显著差异(F(3, 6) = 0.85, p = 0.52)。自动分析显示记录长度、风格和内容丰富度存在很大差异:Gemini 2.5 Pro生成的对话最长且情感表达最丰富,而ChatGPT 4o生成的输出最短且最简洁。

结论

领先的大语言模型可以生成医学准确、情感恰当的合成对话,适用于教育和研究用途。尽管性能很高,但生成患者中的人口统计学同质性凸显了在模型输出中提高多样性和减轻偏差的必要性。这些发现支持谨慎、情境感知地将大语言模型生成的对话整合到医学培训、模拟和研究中。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索