• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

有益、无害、诚实?通过人类反馈强化学习实现人工智能对齐与安全的社会技术限制。

Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback.

作者信息

Dahlgren Lindström Adam, Methnani Leila, Krause Lea, Ericson Petter, de Rituerto de Troya Íñigo Martínez, Coelho Mollo Dimitri, Dobbe Roel

机构信息

Department of Computing Science, Umeå University, Umeå, 90187 Sweden.

Department of Computing Science, Vrije Universiteit Amsterdam, Amsterdam, 1081 De Boelelaan 1105, Netherlands.

出版信息

Ethics Inf Technol. 2025;27(2):28. doi: 10.1007/s10676-025-09837-2. Epub 2025 Jun 4.

DOI:10.1007/s10676-025-09837-2
PMID:40486676
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12137480/
Abstract

This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLHF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics, and contributing to AI safety. We highlight tensions inherent in the goals of RLHF, as captured in the HHH principle (helpful, harmless and honest). In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLHF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We offer an alternative vision for AI safety and ethics which positions RLHF approaches within a broader context of comprehensive design across institutions, processes and technological systems, and suggest the establishment of AI safety as a sociotechnical discipline that is open to the normative and political dimensions of artificial intelligence.

摘要

本文批判性地评估了通过从反馈中进行强化学习的方法,使人工智能(AI)系统,特别是大语言模型(LLMs)与人类价值观和意图保持一致的尝试,这些方法涉及人类反馈(RLHF)或人工智能反馈(RLAIF)。具体而言,我们指出了广泛追求的诚实、无害和有用的对齐目标的不足之处。通过多学科的社会技术批判,我们审视了RLHF技术的理论基础和实际应用,揭示了它们在捕捉人类伦理复杂性方面的显著局限性,并对人工智能安全做出了贡献。我们强调了RLHF目标中固有的紧张关系,如HHH原则(有用、无害和诚实)所体现的那样。此外,我们讨论了在关于对齐和RLHF的讨论中往往被忽视的与伦理相关的问题,其中包括用户友好性与欺骗、灵活性与可解释性以及系统安全之间的权衡。我们为人工智能安全和伦理提供了另一种愿景,将RLHF方法置于机构、流程和技术系统全面设计的更广泛背景下,并建议将人工智能安全确立为一门对人工智能的规范和政治层面持开放态度的社会技术学科。

相似文献

1
Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback.有益、无害、诚实?通过人类反馈强化学习实现人工智能对齐与安全的社会技术限制。
Ethics Inf Technol. 2025;27(2):28. doi: 10.1007/s10676-025-09837-2. Epub 2025 Jun 4.
2
Aligning large language models with radiologists by reinforcement learning from AI feedback for chest CT reports.通过基于人工智能反馈的强化学习使大型语言模型与放射科医生在胸部CT报告方面保持一致。
Eur J Radiol. 2025 Mar;184:111984. doi: 10.1016/j.ejrad.2025.111984. Epub 2025 Feb 6.
3
Utilizing large language models for gastroenterology research: a conceptual framework.利用大语言模型进行胃肠病学研究:一个概念框架。
Therap Adv Gastroenterol. 2025 Apr 1;18:17562848251328577. doi: 10.1177/17562848251328577. eCollection 2025.
4
A framework for mitigating malicious RLHF feedback in LLM training using consensus based reward.一种使用基于共识的奖励来减轻大语言模型训练中恶意基于人类反馈强化学习反馈的框架。
Sci Rep. 2025 Mar 17;15(1):9177. doi: 10.1038/s41598-025-92889-7.
5
A novel voice in head actor critic reinforcement learning with human feedback framework for enhanced robot navigation.一种带有人类反馈框架的新颖的用于增强机器人导航的头脑行动者评论家强化学习中的声音。
Sci Rep. 2025 Feb 28;15(1):7237. doi: 10.1038/s41598-025-92252-w.
6
STELA: a community-centred approach to norm elicitation for AI alignment.STELA:一种面向 AI 对齐的以社区为中心的规范提取方法。
Sci Rep. 2024 Mar 19;14(1):6616. doi: 10.1038/s41598-024-56648-4.
7
Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration.利用生成式人工智能和大语言模型:医疗保健整合综合路线图。
Healthcare (Basel). 2023 Oct 20;11(20):2776. doi: 10.3390/healthcare11202776.
8
Reflections on Putting AI Ethics into Practice: How Three AI Ethics Approaches Conceptualize Theory and Practice.将人工智能伦理付诸实践的思考:三种人工智能伦理方法如何概念化理论与实践。
Sci Eng Ethics. 2023 May 26;29(3):21. doi: 10.1007/s11948-023-00443-3.
9
Bridging Artificial Intelligence and Medical Education: Navigating the Alignment Paradox.架起人工智能与医学教育的桥梁:应对一致性悖论
ATS Sch. 2025 Jun;6(2):135-148. doi: 10.34197/ats-scholar.2024-0086PS. Epub 2025 Mar 20.
10
Ethical Artificial Intelligence in Nursing Workforce Management and Policymaking: Bridging Philosophy and Practice.护理劳动力管理与政策制定中的伦理人工智能:弥合哲学与实践的差距
J Nurs Manag. 2025 Apr 8;2025:7954013. doi: 10.1155/jonm/7954013. eCollection 2025.

本文引用的文献

1
AI deception: A survey of examples, risks, and potential solutions.人工智能欺骗:示例、风险及潜在解决方案综述
Patterns (N Y). 2024 May 10;5(5):100988. doi: 10.1016/j.patter.2024.100988.
2
Don't ask if artificial intelligence is good or fair, ask how it shifts power.不要问人工智能是好是坏或是否公平,要问它如何转移权力。
Nature. 2020 Jul;583(7815):169. doi: 10.1038/d41586-020-02003-2.
3
Anthropomorphism in AI.人工智能中的拟人化。
AJOB Neurosci. 2020 Apr-Jun;11(2):88-95. doi: 10.1080/21507740.2020.1740350.