Pfohl Stephen R, Cole-Lewis Heather, Sayres Rory, Neal Darlene, Asiedu Mercy, Dieng Awa, Tomasev Nenad, Rashid Qazi Mamunur, Azizi Shekoofeh, Rostamzadeh Negar, McCoy Liam G, Celi Leo Anthony, Liu Yun, Schaekermann Mike, Walton Alanna, Parrish Alicia, Nagpal Chirag, Singh Preeti, Dewitt Akeiylah, Mansfield Philip, Prakash Sushant, Heller Katherine, Karthikesalingam Alan, Semturs Christopher, Barral Joelle, Corrado Greg, Matias Yossi, Smith-Loud Jamila, Horn Ivor, Singhal Karan
Google Research, Mountain View, CA, USA.
Google DeepMind, Mountain View, CA, USA.
Nat Med. 2024 Dec;30(12):3590-3600. doi: 10.1038/s41591-024-03258-2. Epub 2024 Sep 23.
Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare.
大型语言模型(LLMs)有望满足复杂的健康信息需求,但也有可能带来危害并加剧健康差距。可靠地评估与公平性相关的模型故障是开发促进健康公平的系统的关键一步。我们提供了一些资源和方法,用于在大型语言模型生成的医学问题长篇答案中发现可能导致与公平性相关危害的偏差,并使用Med-PaLM 2大型语言模型进行了大规模实证案例研究。我们的贡献包括一个用于人工评估大型语言模型生成答案中的偏差的多因素框架以及EquityMedQA,这是一个为对抗性查询而丰富的七个数据集的集合。我们的人工评估框架和数据集设计过程都基于迭代参与式方法以及对Med-PaLM 2答案的审查。通过我们的实证研究,我们发现我们的方法能够发现较窄评估方法可能遗漏的偏差。我们的经验强调了使用多样化评估方法以及让不同背景和专业知识的评分者参与的重要性。虽然我们的方法不足以全面评估人工智能(AI)系统的部署是否能促进公平的健康结果,但我们希望它能够朝着促进可及且公平的医疗保健这一共享目标得到利用和发展。