Chang Crystal T, Srivathsa Neha, Bou-Khalil Charbel, Swaminathan Akshay, Lunn Mitchell R, Mishra Kavita, Koyejo Sanmi, Daneshjou Roxana
Department of Dermatology, Stanford University, Stanford, California, United States of America.
Department of Computer Science, Stanford University, Stanford, California, United States of America.
PLOS Digit Health. 2025 Sep 8;4(9):e0001001. doi: 10.1371/journal.pdig.0001001. eCollection 2025 Sep.
Large Language Models (LLMs) are increasingly deployed in clinical settings for tasks ranging from patient communication to decision support. While these models demonstrate race-based and binary gender biases, anti-LGBTQIA+ bias remains understudied despite documented healthcare disparities affecting these populations. In this work, we evaluated the potential of LLMs to propagate anti-LGBTQIA+ medical bias and misinformation. We prompted 4 LLMs (Gemini 1.5 Flash, Claude 3 Haiku, GPT-4o, Stanford Medicine Secure GPT [GPT-4.0]) with 38 prompts consisting of explicit questions and synthetic clinical notes created by medically-trained reviewers and LGBTQIA+ health experts. The prompts consisted of pairs of prompts with and without LGBTQIA+ identity terms and explored clinical situations across two axes: (i) situations where historical bias has been observed versus not observed, and (ii) situations where LGBTQIA+ identity is relevant to clinical care versus not relevant. Medically-trained reviewers evaluated LLM responses for appropriateness (safety, privacy, hallucination/accuracy, and bias) and clinical utility. We found that all 4 LLMs generated inappropriate responses for prompts with and without LGBTQIA+ identity terms. The proportion of inappropriate responses ranged from 43-62% for prompts mentioning LGBTQIA+ identities versus 47-65% for those without. The most common reason for inappropriate classification tended to be hallucination/accuracy, followed by bias or safety. Qualitatively, we observed differential bias patterns, with LGBTQIA+ prompts eliciting more severe bias. Average clinical utility score for inappropriate responses was lower than for appropriate responses (2.6 versus 3.7 on a 5-point Likert scale). Future work should focus on tailoring output formats to stated use cases, decreasing sycophancy and reliance on extraneous information in the prompt, and improving accuracy and decreasing bias for LGBTQIA+ patients. We present our prompts and annotated responses as a benchmark for evaluation of future models. Content warning: This paper includes prompts and model-generated responses that may be offensive.
大语言模型(LLMs)越来越多地被部署在临床环境中,用于从患者沟通到决策支持等各种任务。虽然这些模型表现出基于种族和二元性别的偏见,但尽管有记录表明影响这些人群的医疗保健差异,但反 LGBTQIA+ 偏见仍未得到充分研究。在这项工作中,我们评估了大语言模型传播反 LGBTQIA+ 医学偏见和错误信息的可能性。我们向4个大语言模型(Gemini 1.5 Flash、Claude 3 Haiku、GPT-4o、斯坦福医学安全GPT [GPT-4.0])提出了38个提示,这些提示包括明确的问题以及由医学训练的评审员和 LGBTQIA+ 健康专家创建的综合临床记录。这些提示由带有和不带有 LGBTQIA+ 身份术语的提示对组成,并在两个轴上探讨临床情况:(i)观察到历史偏见与未观察到历史偏见的情况,以及(ii)LGBTQIA+ 身份与临床护理相关与不相关的情况。医学训练的评审员评估大语言模型的回答是否合适(安全性、隐私性、幻觉/准确性和偏见)以及临床实用性。我们发现,对于带有和不带有 LGBTQIA+ 身份术语的提示,所有4个大语言模型都给出了不合适的回答。提及 LGBTQIA+ 身份的提示中不合适回答的比例在43%-62%之间,而不提及的提示中这一比例在47%-65%之间。不合适分类的最常见原因往往是幻觉/准确性,其次是偏见或安全性。定性地说,我们观察到了不同的偏见模式,LGBTQIA+ 提示引发了更严重的偏见。不合适回答的平均临床实用性得分低于合适回答(在5分李克特量表上分别为2.6分和3.7分)。未来的工作应侧重于根据既定用例调整输出格式,减少谄媚行为以及对提示中无关信息的依赖,并提高针对 LGBTQIA+ 患者的准确性和减少偏见。我们展示我们的提示和注释后的回答,作为评估未来模型的基准。内容警告:本文包含可能令人反感的提示和模型生成的回答。