文献检索，用中文搜 PubMed

应用&插件

Zotero 插件浏览器插件 Mac 客户端 Windows 客户端微信小程序

定价

高级版会员购买积分包购买API积分包

服务

文献检索文档翻译深度研究 API 文档 MCP 服务

关于我们

关于 Suppr 公司介绍联系我们用户协议隐私条款

关注我们

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

粤ICP备2023148730 号-1Suppr @ 2026

Ensuring artificial intelligence behaves in such a way that is aligned with human values is commonly referred to as the alignment challenge. Prior work has shown that rational agents, behaving in such a way that maximizes a utility function, will inevitably behave in such a way that is not aligned with human values, especially as their level of intelligence goes up. Prior work has also shown that there is no "one true utility function"; solutions must include a more holistic approach to alignment. This paper describes apprehensive agents: agents that are architected in such a way that their effective utility function is an aggregation of a partial utility function (built by designers, to be maximized) and an expectation of negative feedback on given states (reasoned about, to be minimized). Agents are also capable of performing a temporal reasoning process that approximates designers' intentions in function of environment evolution (a necessary feature for severe mis-alignment to occur). We show that an apprehensive agent, behaving rationally, leverages this internal approximation of designers' intentions to predict negative feedback, and, as a consequence, behaves in such a way that maximizes alignment, without actually receiving any external feedback. We evaluate this strategy on simulated environments that expose mis-alignment opportunities: we show that apprehensive agents are indeed better aligned than their base counterparts and, in contrast with extant techniques, chances of alignment actually improve as agent intelligence grows.