• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

衡量大语言模型中的性别和种族偏见:来自自动化简历评估的交叉性证据。

Measuring gender and racial biases in large language models: Intersectional evidence from automated resume evaluation.

作者信息

An Jiafu, Huang Difang, Lin Chen, Tai Mingzhu

机构信息

Department of Real Estate and Construction, University of Hong Kong, Hong Kong SAR 999077, China.

Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.

出版信息

PNAS Nexus. 2025 Mar 12;4(3):pgaf089. doi: 10.1093/pnasnexus/pgaf089. eCollection 2025 Mar.

DOI:10.1093/pnasnexus/pgaf089
PMID:40144775
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11937954/
Abstract

In traditional decision-making processes, social biases of human decision makers can lead to unequal economic outcomes for underrepresented social groups, such as women and racial/ethnic minorities (1-4). Recently, the growing popularity of large language model (LLM)-based AI signals a potential shift from human to AI-based decision-making. How would this transition affect the distributional outcomes across social groups? Here, we investigate the gender and racial biases of a number of commonly used LLMs, including OpenAI's GPT-3.5 Turbo and GPT-4o, Google's Gemini 1.5 Flash, Anthropic AI's Claude 3.5 Sonnet, and Meta's Llama 3-70b, in a high-stakes decision-making setting of assessing entry-level job candidates from diverse social groups. Instructing the models to score ∼361,000 resumes with randomized social identities, we find that the LLMs award higher assessment scores for female candidates with similar work experience, education, and skills, but lower scores for black male candidates with comparable qualifications. These biases may result in ∼1-3 percentage-point differences in hiring probabilities for otherwise similar candidates at a certain threshold and are consistent across various job positions and subsamples. Meanwhile, many models are biased against black male candidates. Our results indicate that LLM-based AI systems demonstrate significant biases, varying in terms of the directions and magnitudes across different social groups. Further research is needed to comprehend the root causes of these outcomes and develop strategies to minimize the remaining biases in AI systems. As AI-based decision-making tools are increasingly employed across diverse domains, our findings underscore the necessity of understanding and addressing the potential unequal outcomes to ensure equitable outcomes across social groups.

摘要

在传统决策过程中,人类决策者的社会偏见可能会给代表性不足的社会群体带来不平等的经济结果,比如女性以及种族/族裔少数群体(1 - 4)。最近,基于大语言模型(LLM)的人工智能越来越受欢迎,这标志着可能从人类决策转向基于人工智能的决策。这种转变将如何影响不同社会群体的分配结果呢?在此,我们在一个高风险决策场景中,即评估来自不同社会群体的初级求职者时,研究了一些常用大语言模型的性别和种族偏见,这些模型包括OpenAI的GPT - 3.5 Turbo和GPT - 4o、谷歌的Gemini 1.5 Flash、Anthropic AI的Claude 3.5 Sonnet以及Meta的Llama 3 - 70b。我们让这些模型对约36.1万份带有随机设定社会身份的简历进行评分,发现这些大语言模型会给工作经验、教育背景和技能相似的女性求职者给出更高的评估分数,但给资质相当的黑人男性求职者的分数却更低。在某个阈值下,这些偏见可能会导致其他条件相似的求职者在招聘概率上出现1至3个百分点的差异,并且在各种工作岗位和子样本中都是一致的。同时,许多模型对黑人男性求职者存在偏见。我们的结果表明,基于大语言模型的人工智能系统存在显著偏见,不同社会群体在偏见的方向和程度上各不相同。需要进一步研究以理解这些结果的根本原因,并制定策略来尽量减少人工智能系统中剩余的偏见。随着基于人工智能的决策工具在各个领域越来越多地被使用,我们的研究结果强调了理解和解决潜在不平等结果的必要性,以确保不同社会群体都能获得公平的结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cee/11937954/429104c6cc80/pgaf089f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cee/11937954/1b031d780666/pgaf089f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cee/11937954/cc5432abae80/pgaf089f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cee/11937954/1112d593f532/pgaf089f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cee/11937954/ee023fa1915b/pgaf089f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cee/11937954/429104c6cc80/pgaf089f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cee/11937954/1b031d780666/pgaf089f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cee/11937954/cc5432abae80/pgaf089f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cee/11937954/1112d593f532/pgaf089f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cee/11937954/ee023fa1915b/pgaf089f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1cee/11937954/429104c6cc80/pgaf089f5.jpg

相似文献

1
Measuring gender and racial biases in large language models: Intersectional evidence from automated resume evaluation.衡量大语言模型中的性别和种族偏见:来自自动化简历评估的交叉性证据。
PNAS Nexus. 2025 Mar 12;4(3):pgaf089. doi: 10.1093/pnasnexus/pgaf089. eCollection 2025 Mar.
2
Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports.在从自由文本报告中提取胸部X光检查结果方面,确保隐私的开放权重大型语言模型与封闭权重的GPT-4o具有竞争力。
Radiology. 2025 Jan;314(1):e240895. doi: 10.1148/radiol.240895.
3
Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis.大型语言模型防范生成健康类虚假信息的现行保障措施、风险缓解措施和透明度措施:重复横断面分析。
BMJ. 2024 Mar 20;384:e078538. doi: 10.1136/bmj-2023-078538.
4
Programming Chatbots Using Natural Language: Generating Cervical Spine MRI Impressions.使用自然语言编程聊天机器人:生成颈椎MRI影像报告
Cureus. 2024 Sep 14;16(9):e69410. doi: 10.7759/cureus.69410. eCollection 2024 Sep.
5
Expert of Experts Verification and Alignment (EVAL) Framework for Large Language Models Safety in Gastroenterology.用于胃肠病学中大型语言模型安全的专家验证与对齐(EVAL)框架。
NPJ Digit Med. 2025 May 3;8(1):242. doi: 10.1038/s41746-025-01589-z.
6
Fairness in AI-Driven Oncology: Investigating Racial and Gender Biases in Large Language Models.人工智能驱动的肿瘤学中的公平性:探究大语言模型中的种族和性别偏见
Cureus. 2024 Sep 16;16(9):e69541. doi: 10.7759/cureus.69541. eCollection 2024 Sep.
7
Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力:定性研究
JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.
8
Exploring Biases of Large Language Models in the Field of Mental Health: Comparative Questionnaire Study of the Effect of Gender and Sexual Orientation in Anorexia Nervosa and Bulimia Nervosa Case Vignettes.探索大语言模型在心理健康领域的偏差:神经性厌食症和神经性贪食症病例 vignettes 中性别和性取向影响的比较问卷调查研究。
JMIR Ment Health. 2025 Mar 20;12:e57986. doi: 10.2196/57986.
9
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
10
Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.评估大型语言模型与人类心理健康整合价值观的一致性:使用施瓦茨基本价值观理论的横断面研究。
JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.

本文引用的文献

1
Explicitly unbiased large language models still form biased associations.明确无偏见的大语言模型仍然会形成有偏见的关联。
Proc Natl Acad Sci U S A. 2025 Feb 25;122(8):e2416228122. doi: 10.1073/pnas.2416228122. Epub 2025 Feb 20.
2
Large Language Models based on historical text could offer informative tools for behavioral science.基于历史文本的大语言模型可为行为科学提供信息丰富的工具。
Proc Natl Acad Sci U S A. 2024 Oct 15;121(42):e2407639121. doi: 10.1073/pnas.2407639121. Epub 2024 Oct 9.
3
Cultural bias and cultural alignment of large language models.
大语言模型的文化偏见与文化契合度
PNAS Nexus. 2024 Sep 17;3(9):pgae346. doi: 10.1093/pnasnexus/pgae346. eCollection 2024 Sep.
4
The political preferences of LLMs.大语言模型的政治倾向。
PLoS One. 2024 Jul 31;19(7):e0306621. doi: 10.1371/journal.pone.0306621. eCollection 2024.
5
Perils and opportunities in using large language models in psychological research.在心理学研究中使用大语言模型的风险与机遇
PNAS Nexus. 2024 Jul 16;3(7):pgae245. doi: 10.1093/pnasnexus/pgae245. eCollection 2024 Jul.
6
The impact of generative artificial intelligence on socioeconomic inequalities and policy making.生成式人工智能对社会经济不平等和政策制定的影响。
PNAS Nexus. 2024 Jun 11;3(6):pgae191. doi: 10.1093/pnasnexus/pgae191. eCollection 2024 Jun.
7
Can Generative AI improve social science?生成式人工智能能改进社会科学吗?
Proc Natl Acad Sci U S A. 2024 May 21;121(21):e2314021121. doi: 10.1073/pnas.2314021121. Epub 2024 May 9.
8
The emergence of economic rationality of GPT.GPT 的经济理性的出现。
Proc Natl Acad Sci U S A. 2023 Dec 19;120(51):e2316205120. doi: 10.1073/pnas.2316205120. Epub 2023 Dec 12.
9
How robots can learn to follow a moral code.机器人如何学会遵循道德准则。
Nature. 2023 Oct 26. doi: 10.1038/d41586-023-03258-1.
10
Word embeddings quantify 100 years of gender and ethnic stereotypes.词嵌入量化了 100 年来的性别和种族刻板印象。
Proc Natl Acad Sci U S A. 2018 Apr 17;115(16):E3635-E3644. doi: 10.1073/pnas.1720347115. Epub 2018 Apr 3.