Suppr超能文献

源自内在动机的复杂行为,以占据未来行动状态路径空间。

Complex behavior from intrinsic motivation to occupy future action-state path space.

作者信息

Ramírez-Ruiz Jorge, Grytskyy Dmytro, Mastrogiuseppe Chiara, Habib Yamen, Moreno-Bote Rubén

机构信息

Center for Brain and Cognition, Departament d'Enginyeria i Escola d'Enginyeria, Universitat Pompeu Fabra, Barcelona, Spain.

Serra Húnter Fellow Programme, Universitat Pompeu Fabra, Barcelona, Spain.

出版信息

Nat Commun. 2024 Jul 29;15(1):6368. doi: 10.1038/s41467-024-49711-1.

Abstract

Most theories of behavior posit that agents tend to maximize some form of reward or utility. However, animals very often move with curiosity and seem to be motivated in a reward-free manner. Here we abandon the idea of reward maximization and propose that the goal of behavior is maximizing occupancy of future paths of actions and states. According to this maximum occupancy principle, rewards are the means to occupy path space, not the goal per se; goal-directedness simply emerges as rational ways of searching for resources so that movement, understood amply, never ends. We find that action-state path entropy is the only measure consistent with additivity and other intuitive properties of expected future action-state path occupancy. We provide analytical expressions that relate the optimal policy and state-value function and prove convergence of our value iteration algorithm. Using discrete and continuous state tasks, including a high-dimensional controller, we show that complex behaviors such as "dancing", hide-and-seek, and a basic form of altruistic behavior naturally result from the intrinsic motivation to occupy path space. All in all, we present a theory of behavior that generates both variability and goal-directedness in the absence of reward maximization.

摘要

大多数行为理论假定,主体倾向于使某种形式的奖励或效用最大化。然而,动物常常出于好奇而行动,似乎是以一种无奖励的方式被驱动。在此,我们摒弃奖励最大化的观点,提出行为的目标是使未来行动和状态路径的占有率最大化。根据这一最大占有率原则,奖励是占据路径空间的手段,而非目标本身;目标导向仅仅作为寻找资源的合理方式而出现,这样广义理解的运动就永不停歇。我们发现,行动 - 状态路径熵是唯一与预期未来行动 - 状态路径占有率的可加性及其他直观属性相一致的度量。我们给出了将最优策略与状态值函数相关联的解析表达式,并证明了我们的值迭代算法的收敛性。通过离散和连续状态任务,包括一个高维控制器,我们表明诸如“跳舞”、捉迷藏以及一种基本形式的利他行为等复杂行为自然地源于占据路径空间的内在动机。总而言之,我们提出了一种在没有奖励最大化的情况下产生变异性和目标导向性的行为理论。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c969/11286966/d101b33cfe27/41467_2024_49711_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验