Institute of Digital Technologies, Loughborough University London, 3 Lesney Avenue, E20 3BS, London, United Kingdom.
Neural Netw. 2020 Aug;128:97-106. doi: 10.1016/j.neunet.2020.04.023. Epub 2020 May 4.
Humans live among other humans, not in isolation. Therefore, the ability to learn and behave in multi-agent environments is essential for any autonomous system that intends to interact with people. Due to the presence of multiple simultaneous learners in a multi-agent learning environment, the Markov assumption used for single-agent environments is not tenable, necessitating the development of new Policy Learning algorithms. Recent Actor-Critic algorithms proposed for multi-agent environments, such as Multi-Agent Deep Deterministic Policy Gradients and Counterfactual Multi-Agent Policy Gradients, find a way to use the same mathematical framework as single agent environments by augmenting the Critic with extra information. However, this extra information can slow down the learning process and afflict the Critic with Curse of Dimensionality. To combat this, we propose a novel Deep Neural Network configuration called Deep Multi-Critic Network. This architecture works by taking a weighted sum over the outputs of multiple critic networks of varying complexity and size. The configuration was tested on data collected from a real-world multi-agent environment. The results illustrate that by using Deep Multi-Critic Network, less data is needed to reach the same level of performance as when not using the configuration. This suggests that as the configuration learns faster from less data, then the Critic may be able to learn Q-values faster, accelerating Actor training as well.
人类生活在人类群体中,并非孤立存在。因此,对于任何旨在与人类交互的自主系统来说,在多智能体环境中学习和行为的能力是至关重要的。由于多智能体学习环境中存在多个同时学习的智能体,因此不能采用用于单智能体环境的马尔可夫假设,这就需要开发新的策略学习算法。最近为多智能体环境提出的 Actor-Critic 算法,如多智能体深度确定性策略梯度和反事实多智能体策略梯度,通过使用额外的信息来增强 Critic,从而找到一种在单智能体环境中使用相同数学框架的方法。然而,这种额外的信息可能会减缓学习过程,并使 Critic 受到维度诅咒的影响。为了解决这个问题,我们提出了一种名为深度多 Critic 网络的新深度神经网络配置。这种架构通过对多个具有不同复杂度和大小的 Critic 网络的输出进行加权求和来工作。该配置在从真实多智能体环境中收集的数据上进行了测试。结果表明,通过使用深度多 Critic 网络,与不使用配置相比,需要更少的数据就能达到相同的性能水平。这表明,随着配置从更少的数据中更快地学习,Critic 可能能够更快地学习 Q 值,从而加速 Actor 的训练。