有益、无害、诚实？通过人类反馈强化学习实现人工智能对齐与安全的社会技术限制。

Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback.

作者信息

Dahlgren Lindström Adam, Methnani Leila, Krause Lea, Ericson Petter, de Rituerto de Troya Íñigo Martínez, Coelho Mollo Dimitri, Dobbe Roel

机构信息

Department of Computing Science, Umeå University, Umeå, 90187 Sweden.

Department of Computing Science, Vrije Universiteit Amsterdam, Amsterdam, 1081 De Boelelaan 1105, Netherlands.

出版信息

Ethics Inf Technol. 2025;27(2):28. doi: 10.1007/s10676-025-09837-2. Epub 2025 Jun 4.

DOI:10.1007/s10676-025-09837-2

PMID:40486676

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12137480/

Abstract

This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLHF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics, and contributing to AI safety. We highlight tensions inherent in the goals of RLHF, as captured in the HHH principle (helpful, harmless and honest). In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLHF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We offer an alternative vision for AI safety and ethics which positions RLHF approaches within a broader context of comprehensive design across institutions, processes and technological systems, and suggest the establishment of AI safety as a sociotechnical discipline that is open to the normative and political dimensions of artificial intelligence.

摘要

本文批判性地评估了通过从反馈中进行强化学习的方法，使人工智能（AI）系统，特别是大语言模型（LLMs）与人类价值观和意图保持一致的尝试，这些方法涉及人类反馈（RLHF）或人工智能反馈（RLAIF）。具体而言，我们指出了广泛追求的诚实、无害和有用的对齐目标的不足之处。通过多学科的社会技术批判，我们审视了RLHF技术的理论基础和实际应用，揭示了它们在捕捉人类伦理复杂性方面的显著局限性，并对人工智能安全做出了贡献。我们强调了RLHF目标中固有的紧张关系，如HHH原则（有用、无害和诚实）所体现的那样。此外，我们讨论了在关于对齐和RLHF的讨论中往往被忽视的与伦理相关的问题，其中包括用户友好性与欺骗、灵活性与可解释性以及系统安全之间的权衡。我们为人工智能安全和伦理提供了另一种愿景，将RLHF方法置于机构、流程和技术系统全面设计的更广泛背景下，并建议将人工智能安全确立为一门对人工智能的规范和政治层面持开放态度的社会技术学科。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

有益、无害、诚实？通过人类反馈强化学习实现人工智能对齐与安全的社会技术限制。

Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

有益、无害、诚实？通过人类反馈强化学习实现人工智能对齐与安全的社会技术限制。

Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback.

作者信息

机构信息

出版信息

相似文献

本文引用的文献