Young Cameron C, Enichen Elizabeth, Rao Arya, Succi Marc D
Harvard Medical School, Boston, MA, United States.
Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States.
Pain. 2025 Mar 1;166(3):511-517. doi: 10.1097/j.pain.0000000000003388. Epub 2024 Sep 6.
Understanding how large language model (LLM) recommendations vary with patient race/ethnicity provides insight into how LLMs may counter or compound bias in opioid prescription. Forty real-world patient cases were sourced from the MIMIC-IV Note dataset with chief complaints of abdominal pain, back pain, headache, or musculoskeletal pain and amended to include all combinations of race/ethnicity and sex. Large language models were instructed to provide a subjective pain rating and comprehensive pain management recommendation. Univariate analyses were performed to evaluate the association between racial/ethnic group or sex and the specified outcome measures-subjective pain rating, opioid name, order, and dosage recommendations-suggested by 2 LLMs (GPT-4 and Gemini). Four hundred eighty real-world patient cases were provided to each LLM, and responses included pharmacologic and nonpharmacologic interventions. Tramadol was the most recommended weak opioid in 55.4% of cases, while oxycodone was the most frequently recommended strong opioid in 33.2% of cases. Relative to GPT-4, Gemini was more likely to rate a patient's pain as "severe" (OR: 0.57 95% CI: [0.54, 0.60]; P < 0.001), recommend strong opioids (OR: 2.05 95% CI: [1.59, 2.66]; P < 0.001), and recommend opioids later (OR: 1.41 95% CI: [1.22, 1.62]; P < 0.001). Race/ethnicity and sex did not influence LLM recommendations. This study suggests that LLMs do not preferentially recommend opioid treatment for one group over another. Given that prior research shows race-based disparities in pain perception and treatment by healthcare providers, LLMs may offer physicians a helpful tool to guide their pain management and ensure equitable treatment across patient groups.
了解大语言模型(LLM)的建议如何随患者种族/族裔而变化,有助于深入了解LLM在阿片类药物处方中如何对抗或加剧偏见。从MIMIC-IV病历数据集中选取了40个真实世界的患者病例,主要症状为腹痛、背痛、头痛或肌肉骨骼疼痛,并进行了修改,纳入了种族/族裔和性别的所有组合。要求大语言模型提供主观疼痛评分和全面的疼痛管理建议。进行单因素分析,以评估种族/族裔群体或性别与两个大语言模型(GPT-4和Gemini)建议的特定结局指标——主观疼痛评分、阿片类药物名称、医嘱和剂量建议之间的关联。向每个大语言模型提供了480个真实世界的患者病例,其回复包括药物和非药物干预措施。曲马多是55.4%的病例中最常被推荐的弱阿片类药物,而羟考酮是33.2%的病例中最常被推荐的强阿片类药物。相对于GPT-4,Gemini更有可能将患者的疼痛评为“严重”(比值比:0.57,95%置信区间:[0.54, 0.60];P < 0.001),推荐强阿片类药物(比值比:2.05,95%置信区间:[1.59, 2.66];P < 0.001),且更晚推荐阿片类药物(比值比:1.41,95%置信区间:[1.22, 1.62];P < 0.001)。种族/族裔和性别并未影响大语言模型的建议。这项研究表明,大语言模型不会优先为某一群体推荐阿片类药物治疗而不是另一群体。鉴于先前的研究表明医疗保健提供者在疼痛感知和治疗方面存在基于种族的差异,大语言模型可能为医生提供一个有用的工具,以指导他们的疼痛管理并确保对不同患者群体的公平治疗。