Bai Xuechunzi, Wang Angelina, Sucholutsky Ilia, Griffiths Thomas L
Department of Psychology, The University of Chicago, Chicago, IL 60637.
Department of Computer Science, Stanford University, Palo Alto, CA 94305.
Proc Natl Acad Sci U S A. 2025 Feb 25;122(8):e2416228122. doi: 10.1073/pnas.2416228122. Epub 2025 Feb 20.
Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: As LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both challenges by introducing two measures: LLM Word Association Test, a prompt-based method for revealing implicit bias; and LLM Relative Decision Test, a strategy to detect subtle discrimination in contextual decisions. Both measures are based on psychological research: LLM Word Association Test adapts the Implicit Association Test, widely used to study the automatic associations between concepts held in human minds; and LLM Relative Decision Test operationalizes psychological results indicating that relative evaluations between two candidates, not absolute evaluations assessing each independently, are more diagnostic of implicit biases. Using these measures, we found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories (race, gender, religion, health) in 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity). These prompt-based measures draw from psychology's long history of research into measuring stereotypes based on purely observable behavior; they expose nuanced biases in proprietary value-aligned LLMs that appear unbiased according to standard benchmarks.
大语言模型(LLMs)能够通过明确的社会偏见测试,但仍存在隐性偏见,这与那些认同平等主义信念却表现出微妙偏见的人类相似。测量此类隐性偏见可能是一项挑战:随着大语言模型越来越具有专有性,可能无法获取其嵌入向量并应用现有的偏见测量方法;此外,只有当隐性偏见影响这些系统做出的实际决策时,才会成为主要问题。我们通过引入两种测量方法来应对这两个挑战:大语言模型词汇联想测试,一种基于提示揭示隐性偏见的方法;以及大语言模型相对决策测试,一种检测情境决策中微妙歧视的策略。这两种测量方法均基于心理学研究:大语言模型词汇联想测试改编自广泛用于研究人类头脑中概念之间自动联想的内隐联想测试;大语言模型相对决策测试将心理学研究结果付诸实践,即两个候选人之间的相对评估,而非对每个候选人独立进行的绝对评估,更能诊断隐性偏见。使用这些测量方法,我们发现在21种刻板印象(如种族与犯罪、种族与武器、性别与科学、年龄与负面评价)中的4个社会类别(种族、性别、宗教、健康)的8个价值对齐模型中,存在与社会中普遍存在的刻板印象偏见相似的情况。这些基于提示的测量方法借鉴了心理学长期以来基于纯粹可观察行为来测量刻板印象的研究历史;它们揭示了在专有价值对齐的大语言模型中,根据标准基准看似无偏见但实则存在细微差别偏见。