Zhang Tianle, Liu Zhen, Yi Jianqiang, Wu Shiguang, Pu Zhiqiang, Zhao Yanjie
IEEE Trans Neural Netw Learn Syst. 2024 Sep;35(9):12678-12692. doi: 10.1109/TNNLS.2023.3264275. Epub 2024 Sep 3.
Recently, multiagent reinforcement learning (MARL) has shown great potential for learning cooperative policies in multiagent systems (MASs). However, a noticeable drawback of current MARL is the low sample efficiency, which causes a huge amount of interactions with environment. Such amount of interactions greatly hinders the real-world application of MARL. Fortunately, effectively incorporating experience knowledge can assist MARL to quickly find effective solutions, which can significantly alleviate the drawback. In this article, a novel multiexperience-assisted reinforcement learning (MEARL) method is proposed to improve the learning efficiency of MASs. Specifically, monotonicity-constrained reward shaping is innovatively designed using expert experience to provide additional individual rewards to guide multiagent learning efficiently, with the invariance guarantee of the team optimization objective. Furthermore, a reward distribution estimator is specially developed to model an implicated reward distribution of environment by using transition experience from environment, containing collected samples (state-action pair, reward, and next state). This estimator can predict the expectation reward of each agent for the taken action to accurately estimate the state value function and accelerate its convergence. Besides, the performance of MEARL is evaluated on two multiagent environment platforms: our designed unmanned aerial vehicle combat (UAV-C) and StarCraft II Micromanagement (SCII-M). Simulation results demonstrate that the proposed MEARL can greatly improve the learning efficiency and performance of MASs and is superior to the state-of-the-art methods in multiagent tasks.
最近,多智能体强化学习(MARL)在多智能体系统(MAS)中学习合作策略方面展现出了巨大潜力。然而,当前MARL一个明显的缺点是样本效率低,这导致与环境进行大量交互。如此大量的交互极大地阻碍了MARL在现实世界中的应用。幸运的是,有效整合经验知识可以帮助MARL快速找到有效解决方案,这可以显著缓解该缺点。在本文中,提出了一种新颖的多经验辅助强化学习(MEARL)方法来提高MAS的学习效率。具体而言,利用专家经验创新性地设计了单调性约束奖励塑造,以提供额外的个体奖励来有效指导多智能体学习,同时保证团队优化目标的不变性。此外,专门开发了一种奖励分布估计器,通过使用来自环境的转移经验(包含收集的样本:状态-动作对、奖励和下一个状态)对环境的隐含奖励分布进行建模。该估计器可以预测每个智能体对所采取动作的期望奖励,以准确估计状态值函数并加速其收敛。此外,在两个多智能体环境平台上评估了MEARL的性能:我们设计的无人机作战(UAV-C)和星际争霸II微观管理(SCII-M)。仿真结果表明,所提出的MEARL可以大大提高MAS的学习效率和性能,并且在多智能体任务中优于现有最先进的方法。