Suppr超能文献

MACRPO:多智能体协作循环策略优化

MACRPO: Multi-agent cooperative recurrent policy optimization.

作者信息

Kargar Eshagh, Kyrki Ville

机构信息

Intelligent Robotics Group, Electrical Engineering and Automation Department, Aalto University, Helsinki, Finland.

出版信息

Front Robot AI. 2024 Dec 20;11:1394209. doi: 10.3389/frobt.2024.1394209. eCollection 2024.

Abstract

This work considers the problem of learning cooperative policies in multi-agent settings with partially observable and non-stationary environments without a communication channel. We focus on improving information sharing between agents and propose a new multi-agent actor-critic method called (MACRPO). We propose two novel ways of integrating information across agents and time in MACRPO: First, we use a recurrent layer in the critic's network architecture and propose a new framework to use the proposed meta-trajectory to train the recurrent layer. This allows the network to learn the cooperation and dynamics of interactions between agents, and also handle partial observability. Second, we propose a new advantage function that incorporates other agents' rewards and value functions by controlling the level of cooperation between agents using a parameter. The use of this control parameter is suitable for environments in which the agents are unable to fully cooperate with each other. We evaluate our algorithm on three challenging multi-agent environments with continuous and discrete action spaces, Deepdrive-Zero, Multi-Walker, and Particle environment. We compare the results with several ablations and state-of-the-art multi-agent algorithms such as MAGIC, IC3Net, CommNet, GA-Comm, QMIX, MADDPG, and RMAPPO, and also single-agent methods with shared parameters between agents such as IMPALA and APEX. The results show superior performance against other algorithms. The code is available online at https://github.com/kargarisaac/macrpo.

摘要

这项工作考虑了在没有通信通道的部分可观测且非平稳环境下的多智能体设置中学习合作策略的问题。我们专注于改善智能体之间的信息共享,并提出了一种名为(MACRPO)的新型多智能体演员-评论家方法。我们在MACRPO中提出了两种跨智能体和时间整合信息的新颖方法:第一,我们在评论家的网络架构中使用循环层,并提出了一个新框架,使用所提出的元轨迹来训练循环层。这使网络能够学习智能体之间交互的合作和动态,同时也能处理部分可观测性。第二,我们提出了一种新的优势函数,通过使用一个参数控制智能体之间的合作水平,将其他智能体的奖励和价值函数纳入其中。这种控制参数的使用适用于智能体无法完全相互合作的环境。我们在具有连续和离散动作空间的三个具有挑战性的多智能体环境(Deepdrive-Zero、多智能体步行者和粒子环境)上评估了我们的算法。我们将结果与几种消融方法和多智能体最先进算法(如MAGIC、IC3Net、CommNet、GA-Comm、QMIX、MADDPG和RMAPPO)进行了比较,还与智能体之间具有共享参数的单智能体方法(如IMPALA和APEX)进行了比较。结果显示出优于其他算法的性能。代码可在https://github.com/kargarisaac/macrpo上在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6704/11695781/8944cc3b18ec/frobt-11-1394209-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验