Barfuss Wolfram
School of Mathematics, University of Leeds, Leeds, UK.
Tübingen AI Center, University of Tübingen, Tübingen, Germany.
Neural Comput Appl. 2022;34(3):1653-1671. doi: 10.1007/s00521-021-06117-0. Epub 2021 Jun 23.
A dynamical systems perspective on multi-agent learning, based on the link between evolutionary game theory and reinforcement learning, provides an improved, qualitative understanding of the emerging collective learning dynamics. However, confusion exists with respect to how this dynamical systems account of multi-agent learning should be interpreted. In this article, I propose to embed the dynamical systems description of multi-agent learning into different abstraction levels of cognitive analysis. The purpose of this work is to make the connections between these levels explicit in order to gain improved insight into multi-agent learning. I demonstrate the usefulness of this framework with the general and widespread class of temporal-difference reinforcement learning. I find that its deterministic dynamical systems description follows a minimum free-energy principle and unifies a boundedly rational account of game theory with decision-making under uncertainty. I then propose an on-line sample-batch temporal-difference algorithm which is characterized by the combination of applying a memory-batch and separated state-action value estimation. I find that this algorithm serves as a micro-foundation of the deterministic learning equations by showing that its learning trajectories approach the ones of the deterministic learning equations under large batch sizes. Ultimately, this framework of embedding a dynamical systems description into different abstraction levels gives guidance on how to unleash the full potential of the dynamical systems approach to multi-agent learning.
基于进化博弈论与强化学习之间的联系,从动力系统角度对多智能体学习进行研究,能为新兴的集体学习动态提供更完善的定性理解。然而,对于如何解释这种多智能体学习的动力系统描述,仍存在困惑。在本文中,我提议将多智能体学习的动力系统描述嵌入到认知分析的不同抽象层次中。这项工作的目的是明确这些层次之间的联系,以便更好地理解多智能体学习。我通过广泛应用的时间差分强化学习类别来证明该框架的实用性。我发现其确定性动力系统描述遵循最小自由能原理,并将博弈论的有限理性解释与不确定性下的决策统一起来。然后,我提出了一种在线样本批量时间差分算法,其特点是结合了应用记忆批量和分离状态 - 动作值估计。我发现该算法通过表明在大批量情况下其学习轨迹接近确定性学习方程的轨迹,从而成为确定性学习方程的微观基础。最终,这种将动力系统描述嵌入不同抽象层次的框架为如何充分发挥动力系统方法在多智能体学习中的潜力提供了指导。