Malekzadeh Parvin, Plataniotis Konstantinos N
Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, M5S 3G8, Canada
Neural Comput. 2024 Sep 17;36(10):2073-2135. doi: 10.1162/neco_a_01698.
Reinforcement learning (RL) has garnered significant attention for developing decision-making agents that aim to maximize rewards, specified by an external supervisor, within fully observable environments. However, many real-world problems involve partial or noisy observations, where agents cannot access complete and accurate information about the environment. These problems are commonly formulated as partially observable Markov decision processes (POMDPs). Previous studies have tackled RL in POMDPs by either incorporating the memory of past actions and observations or by inferring the true state of the environment from observed data. Nevertheless, aggregating observations and actions over time becomes impractical in problems with large decision-making time horizons and high-dimensional spaces. Furthermore, inference-based RL approaches often require many environmental samples to perform well, as they focus solely on reward maximization and neglect uncertainty in the inferred state. Active inference (AIF) is a framework naturally formulated in POMDPs and directs agents to select actions by minimizing a function called expected free energy (EFE). This supplies reward-maximizing (or exploitative) behavior, as in RL, with information-seeking (or exploratory) behavior. Despite this exploratory behavior of AIF, its use is limited to problems with small time horizons and discrete spaces due to the computational challenges associated with EFE. In this article, we propose a unified principle that establishes a theoretical connection between AIF and RL, enabling seamless integration of these two approaches and overcoming their limitations in continuous space POMDP settings. We substantiate our findings with rigorous theoretical analysis, providing novel perspectives for using AIF in designing and implementing artificial agents. Experimental results demonstrate the superior learning capabilities of our method compared to other alternative RL approaches in solving partially observable tasks with continuous spaces. Notably, our approach harnesses information-seeking exploration, enabling it to effectively solve reward-free problems and rendering explicit task reward design by an external supervisor optional.
强化学习(RL)在开发决策智能体方面备受关注,这些智能体旨在在完全可观测的环境中最大化由外部监督者指定的奖励。然而,许多现实世界的问题涉及部分或有噪声的观测,在这种情况下,智能体无法获取关于环境的完整准确信息。这些问题通常被表述为部分可观测马尔可夫决策过程(POMDPs)。先前的研究通过纳入过去动作和观测的记忆,或者从观测数据推断环境的真实状态,来解决POMDPs中的强化学习问题。然而,在具有大决策时间跨度和高维空间的问题中,随着时间的推移聚合观测和动作变得不切实际。此外,基于推理的强化学习方法通常需要许多环境样本才能表现良好,因为它们仅专注于奖励最大化,而忽略了推断状态中的不确定性。主动推理(AIF)是一个自然地在POMDPs中形成的框架,它指导智能体通过最小化一个称为预期自由能(EFE)的函数来选择动作。这为像强化学习中那样的奖励最大化(或利用)行为提供了信息寻求(或探索)行为。尽管AIF有这种探索行为,但由于与EFE相关的计算挑战,其应用仅限于小时间跨度和离散空间的问题。在本文中,我们提出了一个统一的原则,该原则在AIF和RL之间建立了理论联系,实现了这两种方法的无缝集成,并克服了它们在连续空间POMDP设置中的局限性。我们通过严格的理论分析证实了我们的发现,为在设计和实现人工智能体中使用AIF提供了新的视角。实验结果表明,与其他替代强化学习方法相比,我们的方法在解决具有连续空间的部分可观测任务方面具有卓越的学习能力。值得注意的是,我们的方法利用信息寻求探索,使其能够有效地解决无奖励问题,并使外部监督者进行明确的任务奖励设计成为可选。