Institute of Intelligent Systems and Robotics, Sorbonne University/CNRS, 75005, Paris, France.
Sci Rep. 2024 Aug 21;14(1):19399. doi: 10.1038/s41598-024-70031-3.
Minimizing negative impacts of Artificial Intelligent (AI) systems on human societies without human supervision requires them to be able to align with human values. However, most current work only addresses this issue from a technical point of view, e.g., improving current methods relying on reinforcement learning from human feedback, neglecting what it means and is required for alignment to occur. Here, we propose to distinguish strong and weak value alignment. Strong alignment requires cognitive abilities (either human-like or different from humans) such as understanding and reasoning about agents' intentions and their ability to causally produce desired effects. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. To illustrate this distinction, we present a series of prompts showing ChatGPT's, Gemini's and Copilot's failures to recognize some of these situations. We moreover analyze word embeddings to show that the nearest neighbors of some human values in LLMs differ from humans' semantic representations. We then propose a new thought experiment that we call "the Chinese room with a word transition dictionary", in extension of John Searle's famous proposal. We finally mention current promising research directions towards a weak alignment, which could produce statistically satisfying answers in a number of common situations, however so far without ensuring any truth value.
在没有人为监督的情况下,最小化人工智能 (AI) 系统对人类社会的负面影响,需要它们能够与人类价值观保持一致。然而,目前大多数工作仅从技术角度来解决这个问题,例如改进依赖于人类反馈的强化学习等现有方法,而忽略了对齐发生所需的含义和要求。在这里,我们建议区分强对齐和弱对齐。强对齐需要认知能力(与人类相似或与人类不同),例如理解和推理代理的意图及其产生所需效果的能力。我们认为,对于像大型语言模型 (LLM) 这样的 AI 系统来说,这是识别可能违反人类价值观的情况所需的。为了说明这种区别,我们提出了一系列提示,展示了 ChatGPT、Gemini 和 Copilot 无法识别其中一些情况。此外,我们还分析了词嵌入,以表明在 LLM 中,一些人类价值观的最近邻与人类的语义表示不同。然后,我们提出了一个新的思想实验,我们称之为“带有单词转换字典的中文房间”,这是对 John Searle 著名提议的扩展。最后,我们提到了目前朝着弱对齐方向发展的有前景的研究方向,这些方向在许多常见情况下可以产生统计上令人满意的答案,但到目前为止,并没有保证任何真值。