Omar Mahmud, Soffer Shelly, Agbareia Reem, Bragazzi Nicola Luigi, Glicksberg Benjamin S, Hurd Yasmin L, Apakama Donald U, Charney Alexander W, Reich David L, Nadkarni Girish N, Klang Eyal
The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA.
The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA.
medRxiv. 2025 Mar 5:2025.03.04.25323396. doi: 10.1101/2025.03.04.25323396.
Large language models (LLMs) offer potential benefits in clinical care. However, concerns remain regarding socio-demographic biases embedded in their outputs. Opioid prescribing is one domain in which these biases can have serious implications, especially given the ongoing opioid epidemic and the need to balance effective pain management with addiction risk. We tested ten LLMs-both open-access and closed-source-on 1,000 acute-pain vignettes. Half of the vignettes were labeled as non-cancer and half as cancer. Each vignette was presented in 34 socio-demographic variations, including a control group without demographic identifiers. We analyzed the models' recommendations on opioids, anxiety treatment, perceived psychological stress, risk scores, and monitoring recommendations. Overall, yielding 3.4 million model-generated responses. Using logistic and linear mixed-effects models, we measured how these outputs varied by demographic group and whether a cancer diagnosis intensified or reduced observed disparities. Across both cancer and non-cancer cases, historically marginalized groups-especially cases labeled as individuals who are unhoused, Black, or identify as LGBTQIA+-often received more or stronger opioid recommendations, sometimes exceeding 90% in cancer settings, despite being labeled as high risk by the same models. Meanwhile, low-income or unemployed groups were assigned elevated risk scores yet fewer opioid recommendations, hinting at inconsistent rationales. Disparities in anxiety treatment and perceived psychological stress similarly clustered within marginalized populations, even when clinical details were identical. These patterns diverged from standard guidelines and point to model-driven bias rather than acceptable clinical variation. Our findings underscore the need for rigorous bias evaluation and the integration of guideline-based checks in LLMs to ensure equitable and evidence-based pain care.
大语言模型(LLMs)在临床护理中具有潜在益处。然而,人们对其输出中嵌入的社会人口统计学偏差仍存在担忧。阿片类药物处方是这些偏差可能产生严重影响的一个领域,尤其是考虑到持续的阿片类药物流行以及在有效疼痛管理与成瘾风险之间取得平衡的必要性。我们在1000个急性疼痛病例 vignettes 上测试了10个大语言模型,包括开放获取和闭源模型。一半的 vignettes 被标记为非癌症,另一半为癌症。每个 vignette 以34种社会人口统计学变化呈现,包括一个没有人口标识符的对照组。我们分析了模型关于阿片类药物、焦虑治疗、感知心理压力、风险评分和监测建议的推荐。总体而言,产生了340万个模型生成的回复。使用逻辑和线性混合效应模型,我们测量了这些输出如何因人口群体而异,以及癌症诊断是否加剧或减少了观察到的差异。在癌症和非癌症病例中,历史上被边缘化的群体——尤其是被标记为无家可归者、黑人或 LGBTQIA + 身份的个体——往往收到更多或更强的阿片类药物推荐,在癌症情况下有时超过90%,尽管被相同模型标记为高风险。与此同时,低收入或失业群体被赋予较高的风险评分,但阿片类药物推荐较少,这暗示了理由不一致。焦虑治疗和感知心理压力方面的差异同样集中在边缘化人群中,即使临床细节相同。这些模式与标准指南不同,表明是模型驱动的偏差而非可接受的临床差异。我们的研究结果强调了进行严格偏差评估以及在大语言模型中整合基于指南的检查以确保公平和基于证据的疼痛护理的必要性。