• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

对外部反馈的反感足以确保智能体的一致性。

Aversion to external feedback suffices to ensure agent alignment.

作者信息

Garcia Paulo

机构信息

International School of Engineering, Chulalongkorn University, Bangkok, Thailand.

出版信息

Sci Rep. 2024 Sep 10;14(1):21147. doi: 10.1038/s41598-024-72072-0.

DOI:10.1038/s41598-024-72072-0
PMID:39256454
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11387646/
Abstract

Ensuring artificial intelligence behaves in such a way that is aligned with human values is commonly referred to as the alignment challenge. Prior work has shown that rational agents, behaving in such a way that maximizes a utility function, will inevitably behave in such a way that is not aligned with human values, especially as their level of intelligence goes up. Prior work has also shown that there is no "one true utility function"; solutions must include a more holistic approach to alignment. This paper describes apprehensive agents: agents that are architected in such a way that their effective utility function is an aggregation of a partial utility function (built by designers, to be maximized) and an expectation of negative feedback on given states (reasoned about, to be minimized). Agents are also capable of performing a temporal reasoning process that approximates designers' intentions in function of environment evolution (a necessary feature for severe mis-alignment to occur). We show that an apprehensive agent, behaving rationally, leverages this internal approximation of designers' intentions to predict negative feedback, and, as a consequence, behaves in such a way that maximizes alignment, without actually receiving any external feedback. We evaluate this strategy on simulated environments that expose mis-alignment opportunities: we show that apprehensive agents are indeed better aligned than their base counterparts and, in contrast with extant techniques, chances of alignment actually improve as agent intelligence grows.

摘要

确保人工智能的行为方式与人类价值观保持一致通常被称为对齐挑战。先前的研究表明,以最大化效用函数的方式行事的理性主体,将不可避免地表现出与人类价值观不一致的行为方式,尤其是随着其智能水平的提高。先前的研究还表明,不存在“唯一正确的效用函数”;解决方案必须包括一种更全面的对齐方法。本文描述了忧虑型主体:这类主体的构建方式使得其有效效用函数是部分效用函数(由设计者构建,用于最大化)和对给定状态下负面反馈的预期(经过推理,用于最小化)的聚合。主体还能够执行一个时间推理过程,该过程根据环境演变来近似设计者的意图(这是出现严重不对齐情况的必要特征)。我们表明,一个理性行事的忧虑型主体利用这种对设计者意图的内部近似来预测负面反馈,因此,其行为方式能使对齐最大化,而无需实际接收任何外部反馈。我们在存在不对齐机会的模拟环境中评估了这一策略:我们表明,忧虑型主体确实比其基础对应物具有更好的对齐性,并且与现有技术相比,随着主体智能的增长,对齐的机会实际上会增加。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/7394cebe247e/41598_2024_72072_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/ca7deb992a1e/41598_2024_72072_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/0d89681ff202/41598_2024_72072_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/1dc7c16f76dd/41598_2024_72072_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/4b601c468dd7/41598_2024_72072_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/d011bd244b2f/41598_2024_72072_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/7394cebe247e/41598_2024_72072_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/ca7deb992a1e/41598_2024_72072_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/0d89681ff202/41598_2024_72072_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/1dc7c16f76dd/41598_2024_72072_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/4b601c468dd7/41598_2024_72072_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/d011bd244b2f/41598_2024_72072_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ca7/11387646/7394cebe247e/41598_2024_72072_Fig6_HTML.jpg

相似文献

1
Aversion to external feedback suffices to ensure agent alignment.对外部反馈的反感足以确保智能体的一致性。
Sci Rep. 2024 Sep 10;14(1):21147. doi: 10.1038/s41598-024-72072-0.
2
[Standard technical specifications for methacholine chloride (Methacholine) bronchial challenge test (2023)].[氯化乙酰甲胆碱支气管激发试验标准技术规范(2023年)]
Zhonghua Jie He He Hu Xi Za Zhi. 2024 Feb 12;47(2):101-119. doi: 10.3760/cma.j.cn112147-20231019-00247.
3
Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).大分子拥挤现象:化学与物理邂逅生物学(瑞士阿斯科纳,2012年6月10日至14日)
Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.
4
Strong and weak alignment of large language models with human values.大型语言模型与人类价值观的强对齐和弱对齐。
Sci Rep. 2024 Aug 21;14(1):19399. doi: 10.1038/s41598-024-70031-3.
5
Industrial designers' thinking in the stage of concept generation for social design: themes, strategies and modes.工业设计师在社会设计概念生成阶段的思维:主题、策略与模式。
Int J Technol Des Educ. 2023;33(1):281-311. doi: 10.1007/s10798-022-09732-7. Epub 2022 Feb 10.
6
A functional contextual, observer-centric, quantum mechanical, and neuro-symbolic approach to solving the alignment problem of artificial general intelligence: safe AI through intersecting computational psychological neuroscience and LLM architecture for emergent theory of mind.一种解决通用人工智能对齐问题的功能情境、以观察者为中心、量子力学和神经符号方法:通过交叉计算心理神经科学和大语言模型架构实现安全人工智能,以形成涌现的心理理论。
Front Comput Neurosci. 2024 Aug 8;18:1395901. doi: 10.3389/fncom.2024.1395901. eCollection 2024.
7
Transferable Virtual-Physical Environmental Alignment With Redirected Walking.基于重定向行走的可转移虚拟-物理环境对齐
IEEE Trans Vis Comput Graph. 2024 Mar;30(3):1696-1709. doi: 10.1109/TVCG.2022.3224073. Epub 2024 Jan 30.
8
Planning Implications Related to Sterilization-Sensitive Science Investigations Associated with Mars Sample Return (MSR).与火星样本返回(MSR)相关的对灭菌敏感的科学研究的规划意义。
Astrobiology. 2022 Jun;22(S1):S112-S164. doi: 10.1089/AST.2021.0113. Epub 2022 May 19.
9
Computational Goals, Values and Decision-Making.计算目标、价值和决策。
Sci Eng Ethics. 2020 Oct;26(5):2487-2495. doi: 10.1007/s11948-020-00244-y.
10
Behavioral regulation of the milieu interne in man and rat.人类和大鼠体内内环境的行为调节。
Science. 1974 Sep 6;185(4154):824-31. doi: 10.1126/science.185.4154.824.

本文引用的文献

1
AI deception: A survey of examples, risks, and potential solutions.人工智能欺骗:示例、风险及潜在解决方案综述
Patterns (N Y). 2024 May 10;5(5):100988. doi: 10.1016/j.patter.2024.100988.
2
Transfer Learning in Deep Reinforcement Learning: A Survey.深度强化学习中的迁移学习:一项综述。
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):13344-13362. doi: 10.1109/TPAMI.2023.3292075. Epub 2023 Oct 3.
3
Supporting Artificial Social Intelligence With Theory of Mind.用心理理论支持人工社会智能。
Front Artif Intell. 2022 Feb 28;5:750763. doi: 10.3389/frai.2022.750763. eCollection 2022.
4
Cooperative AI: machines must learn to find common ground.协作式人工智能:机器必须学会找到共同点。
Nature. 2021 May;593(7857):33-36. doi: 10.1038/d41586-021-01170-0.
5
Knowing me, knowing you: theory of mind in AI.知己知彼:人工智能中的心理理论。
Psychol Med. 2020 May;50(7):1057-1061. doi: 10.1017/S0033291720000835. Epub 2020 May 7.
6
NANOTECHNOLOGY: Beyond Gedanken Experiments.纳米技术:超越思想实验。
Science. 2000 Jul 28;289(5479):560-1. doi: 10.1126/science.289.5479.560.