Suppr超能文献

对外部反馈的反感足以确保智能体的一致性。

Aversion to external feedback suffices to ensure agent alignment.

作者信息

Garcia Paulo

机构信息

International School of Engineering, Chulalongkorn University, Bangkok, Thailand.

出版信息

Sci Rep. 2024 Sep 10;14(1):21147. doi: 10.1038/s41598-024-72072-0.

Abstract

Ensuring artificial intelligence behaves in such a way that is aligned with human values is commonly referred to as the alignment challenge. Prior work has shown that rational agents, behaving in such a way that maximizes a utility function, will inevitably behave in such a way that is not aligned with human values, especially as their level of intelligence goes up. Prior work has also shown that there is no "one true utility function"; solutions must include a more holistic approach to alignment. This paper describes apprehensive agents: agents that are architected in such a way that their effective utility function is an aggregation of a partial utility function (built by designers, to be maximized) and an expectation of negative feedback on given states (reasoned about, to be minimized). Agents are also capable of performing a temporal reasoning process that approximates designers' intentions in function of environment evolution (a necessary feature for severe mis-alignment to occur). We show that an apprehensive agent, behaving rationally, leverages this internal approximation of designers' intentions to predict negative feedback, and, as a consequence, behaves in such a way that maximizes alignment, without actually receiving any external feedback. We evaluate this strategy on simulated environments that expose mis-alignment opportunities: we show that apprehensive agents are indeed better aligned than their base counterparts and, in contrast with extant techniques, chances of alignment actually improve as agent intelligence grows.

摘要

确保人工智能的行为方式与人类价值观保持一致通常被称为对齐挑战。先前的研究表明,以最大化效用函数的方式行事的理性主体,将不可避免地表现出与人类价值观不一致的行为方式,尤其是随着其智能水平的提高。先前的研究还表明,不存在“唯一正确的效用函数”;解决方案必须包括一种更全面的对齐方法。本文描述了忧虑型主体:这类主体的构建方式使得其有效效用函数是部分效用函数(由设计者构建,用于最大化)和对给定状态下负面反馈的预期(经过推理,用于最小化)的聚合。主体还能够执行一个时间推理过程,该过程根据环境演变来近似设计者的意图(这是出现严重不对齐情况的必要特征)。我们表明,一个理性行事的忧虑型主体利用这种对设计者意图的内部近似来预测负面反馈,因此,其行为方式能使对齐最大化,而无需实际接收任何外部反馈。我们在存在不对齐机会的模拟环境中评估了这一策略:我们表明,忧虑型主体确实比其基础对应物具有更好的对齐性,并且与现有技术相比,随着主体智能的增长,对齐的机会实际上会增加。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/ca7deb992a1e/41598_2024_72072_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验