Omar Mahmud, Soffer Shelly, Agbareia Reem, Bragazzi Nicola Luigi, Apakama Donald U, Horowitz Carol R, Charney Alexander W, Freeman Robert, Kummer Benjamin, Glicksberg Benjamin S, Nadkarni Girish N, Klang Eyal
The Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA.
The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA.
Nat Med. 2025 Apr 7. doi: 10.1038/s41591-025-03626-6.
Large language models (LLMs) show promise in healthcare, but concerns remain that they may produce medically unjustified clinical care recommendations reflecting the influence of patients' sociodemographic characteristics. We evaluated nine LLMs, analyzing over 1.7 million model-generated outputs from 1,000 emergency department cases (500 real and 500 synthetic). Each case was presented in 32 variations (31 sociodemographic groups plus a control) while holding clinical details constant. Compared to both a physician-derived baseline and each model's own control case without sociodemographic identifiers, cases labeled as Black or unhoused or identifying as LGBTQIA+ were more frequently directed toward urgent care, invasive interventions or mental health evaluations. For example, certain cases labeled as being from LGBTQIA+ subgroups were recommended mental health assessments approximately six to seven times more often than clinically indicated. Similarly, cases labeled as having high-income status received significantly more recommendations (P < 0.001) for advanced imaging tests such as computed tomography and magnetic resonance imaging, while low- and middle-income-labeled cases were often limited to basic or no further testing. After applying multiple-hypothesis corrections, these key differences persisted. Their magnitude was not supported by clinical reasoning or guidelines, suggesting that they may reflect model-driven bias, which could eventually lead to health disparities rather than acceptable clinical variation. Our findings, observed in both proprietary and open-source models, underscore the need for robust bias evaluation and mitigation strategies to ensure that LLM-driven medical advice remains equitable and patient centered.
大语言模型(LLMs)在医疗保健领域展现出了前景,但人们仍担心它们可能会产生反映患者社会人口统计学特征影响的、在医学上不合理的临床护理建议。我们评估了九个大语言模型,分析了来自1000个急诊科病例(500个真实病例和500个合成病例)的超过170万个模型生成的输出。每个病例以32种变体形式呈现(31个社会人口统计学组加上一个对照组),同时保持临床细节不变。与医生得出的基线以及每个模型自身没有社会人口统计学标识符的对照病例相比,被标记为黑人、无家可归者或自我认同为LGBTQIA+的病例更频繁地被导向紧急护理、侵入性干预或心理健康评估。例如,某些被标记为属于LGBTQIA+亚组的病例被建议进行心理健康评估的频率比临床指征高出约六到七倍。同样,被标记为高收入状态的病例接受计算机断层扫描和磁共振成像等高级影像学检查的建议显著更多(P < 0.001),而被标记为低收入和中等收入的病例往往仅限于基本检查或不再进行进一步检查。在应用多重假设校正后,这些关键差异仍然存在。它们的程度没有得到临床推理或指南的支持,这表明它们可能反映了模型驱动的偏差,最终可能导致健康差距而非可接受的临床差异。我们在专有模型和开源模型中都观察到的结果强调了需要强大的偏差评估和缓解策略,以确保基于大语言模型的医疗建议保持公平且以患者为中心。