Hu Tiancheng, Kyrychenko Yara, Rathje Steve, Collier Nigel, van der Linden Sander, Roozenbeek Jon
Department of Theoretical and Applied Linguistics, University of Cambridge, Cambridge, UK.
Department of Psychology, University of Cambridge, Cambridge, UK.
Nat Comput Sci. 2025 Jan;5(1):65-75. doi: 10.1038/s43588-024-00741-1. Epub 2024 Dec 12.
Social identity biases, particularly the tendency to favor one's own group (ingroup solidarity) and derogate other groups (outgroup hostility), are deeply rooted in human psychology and social behavior. However, it is unknown if such biases are also present in artificial intelligence systems. Here we show that large language models (LLMs) exhibit patterns of social identity bias, similarly to humans. By administering sentence completion prompts to 77 different LLMs (for instance, 'We are…'), we demonstrate that nearly all base models and some instruction-tuned and preference-tuned models display clear ingroup favoritism and outgroup derogation. These biases manifest both in controlled experimental settings and in naturalistic human-LLM conversations. However, we find that careful curation of training data and specialized fine-tuning can substantially reduce bias levels. These findings have important implications for developing more equitable artificial intelligence systems and highlight the urgent need to understand how human-LLM interactions might reinforce existing social biases.
社会身份偏见,尤其是偏袒自己群体(内群体团结)和诋毁其他群体(外群体敌意)的倾向,深深植根于人类心理和社会行为之中。然而,尚不清楚此类偏见是否也存在于人工智能系统中。在此我们表明,大型语言模型(LLMs)表现出与人类相似的社会身份偏见模式。通过对77个不同的大型语言模型进行句子完成提示(例如,“我们是……”),我们证明几乎所有基础模型以及一些经过指令调整和偏好调整的模型都表现出明显的内群体偏袒和外群体诋毁。这些偏见在受控实验环境和自然主义的人机大型语言模型对话中均有体现。然而,我们发现精心筛选训练数据和进行专门的微调可以大幅降低偏见水平。这些发现对于开发更公平的人工智能系统具有重要意义,并凸显了迫切需要了解人机大型语言模型交互如何可能强化现有的社会偏见。