Zhu Yuanheng, Li Weifan, Zhao Mengchen, Hao Jianye, Zhao Dongbin
IEEE Trans Cybern. 2023 Oct;53(10):6443-6455. doi: 10.1109/TCYB.2022.3179775. Epub 2023 Sep 15.
In single-agent Markov decision processes, an agent can optimize its policy based on the interaction with the environment. In multiplayer Markov games (MGs), however, the interaction is nonstationary due to the behaviors of other players, so the agent has no fixed optimization objective. The challenge becomes finding equilibrium policies for all players. In this research, we treat the evolution of player policies as a dynamical process and propose a novel learning scheme for Nash equilibrium. The core is to evolve one's policy according to not just its current in-game performance, but an aggregation of its performance over history. We show that for a variety of MGs, players in our learning scheme will provably converge to a point that is an approximation to Nash equilibrium. Combined with neural networks, we develop an empirical policy optimization algorithm, which is implemented in a reinforcement-learning framework and runs in a distributed way, with each player optimizing its policy based on own observations. We use two numerical examples to validate the convergence property on small-scale MGs, and a pong example to show the potential on large games.
在单智能体马尔可夫决策过程中,智能体可以基于与环境的交互来优化其策略。然而,在多智能体马尔可夫博弈(MGs)中,由于其他玩家的行为,交互是非平稳的,因此智能体没有固定的优化目标。挑战在于为所有玩家找到均衡策略。在本研究中,我们将玩家策略的演变视为一个动态过程,并提出了一种新颖的纳什均衡学习方案。核心是不仅根据其当前的游戏内表现,还根据其历史表现的汇总来演变其策略。我们表明,对于各种MGs,我们学习方案中的玩家将可证明地收敛到一个接近纳什均衡的点。结合神经网络,我们开发了一种经验策略优化算法,该算法在强化学习框架中实现并以分布式方式运行,每个玩家根据自己的观察来优化其策略。我们使用两个数值示例来验证小规模MGs上的收敛特性,并使用一个乒乓球示例来展示在大型游戏上的潜力。