用于两人零和马尔可夫博弈的在线极小极大Q网络学习

Online Minimax Q Network Learning for Two-Player Zero-Sum Markov Games.

作者信息

Zhu Yuanheng, Zhao Dongbin

出版信息

IEEE Trans Neural Netw Learn Syst. 2022 Mar;33(3):1228-1241. doi: 10.1109/TNNLS.2020.3041469. Epub 2022 Feb 28.

DOI:10.1109/TNNLS.2020.3041469

Abstract

The Nash equilibrium is an important concept in game theory. It describes the least exploitability of one player from any opponents. We combine game theory, dynamic programming, and recent deep reinforcement learning (DRL) techniques to online learn the Nash equilibrium policy for two-player zero-sum Markov games (TZMGs). The problem is first formulated as a Bellman minimax equation, and generalized policy iteration (GPI) provides a double-loop iterative way to find the equilibrium. Then, neural networks are introduced to approximate Q functions for large-scale problems. An online minimax Q network learning algorithm is proposed to train the network with observations. Experience replay, dueling network, and double Q-learning are applied to improve the learning process. The contributions are twofold: 1) DRL techniques are combined with GPI to find the TZMG Nash equilibrium for the first time and 2) the convergence of the online learning algorithm with a lookup table and experience replay is proven, whose proof is not only useful for TZMGs but also instructive for single-agent Markov decision problems. Experiments on different examples validate the effectiveness of the proposed algorithm on TZMG problems.

摘要

纳什均衡是博弈论中的一个重要概念。它描述了一个玩家相对于任何对手而言被利用的可能性最小的情况。我们将博弈论、动态规划和近期的深度强化学习（DRL）技术相结合，以在线学习两人零和马尔可夫博弈（TZMG）的纳什均衡策略。该问题首先被表述为一个贝尔曼极小极大方程，广义策略迭代（GPI）提供了一种双循环迭代方法来找到均衡。然后，引入神经网络来近似大规模问题的Q函数。提出了一种在线极小极大Q网络学习算法，用于根据观测值训练网络。经验回放、对决网络和双Q学习被应用于改进学习过程。贡献有两方面：1）首次将DRL技术与GPI相结合来找到TZMG的纳什均衡；2）证明了带有查找表和经验回放的在线学习算法的收敛性，该证明不仅对TZMG有用，而且对单智能体马尔可夫决策问题也具有指导意义。在不同示例上的实验验证了所提算法在TZMG问题上的有效性。

相似文献

Online Minimax Q Network Learning for Two-Player Zero-Sum Markov Games.用于两人零和马尔可夫博弈的在线极小极大Q网络学习

IEEE Trans Neural Netw Learn Syst. 2022 Mar;33(3):1228-1241. doi: 10.1109/TNNLS.2020.3041469. Epub 2022 Feb 28.

Solving the Zero-Sum Control Problem for Tidal Turbine System: An Online Reinforcement Learning Approach.解决潮汐涡轮机系统的零和控制问题：一种在线强化学习方法。

IEEE Trans Cybern. 2023 Dec;53(12):7635-7647. doi: 10.1109/TCYB.2022.3186886. Epub 2023 Nov 29.

Data-Based Reinforcement Learning for Nonzero-Sum Games With Unknown Drift Dynamics.具有未知漂移动态的非零和博弈的基于数据的强化学习

IEEE Trans Cybern. 2019 Aug;49(8):2874-2885. doi: 10.1109/TCYB.2018.2830820. Epub 2018 May 16.

Empirical Policy Optimization for n-Player Markov Games.n 人马尔可夫博弈的经验策略优化

IEEE Trans Cybern. 2023 Oct;53(10):6443-6455. doi: 10.1109/TCYB.2022.3179775. Epub 2023 Sep 15.

Experience Replay for Optimal Control of Nonzero-Sum Game Systems With Unknown Dynamics.具有未知动态的非零和博弈系统最优控制的经验回放。

IEEE Trans Cybern. 2016 Mar;46(3):854-65. doi: 10.1109/TCYB.2015.2488680. Epub 2015 Oct 26.

Online Solution of Two-Player Zero-Sum Games for Continuous-Time Nonlinear Systems With Completely Unknown Dynamics.在线求解具有完全未知动态的连续时间非线性系统的二人零和博弈

IEEE Trans Neural Netw Learn Syst. 2016 Dec;27(12):2577-2587. doi: 10.1109/TNNLS.2015.2496299. Epub 2015 Nov 20.

Neural Q-learning for discrete-time nonlinear zero-sum games with adjustable convergence rate.具有可调收敛速度的离散时间非线性零和博弈的神经 Q 学习。

Neural Netw. 2024 Jul;175:106274. doi: 10.1016/j.neunet.2024.106274. Epub 2024 Mar 27.

Deep Reinforcement Learning for Nash Equilibrium of Differential Games.用于微分博弈纳什均衡的深度强化学习

IEEE Trans Neural Netw Learn Syst. 2025 Feb;36(2):2747-2761. doi: 10.1109/TNNLS.2024.3351631. Epub 2025 Feb 6.

On the complexity of computing Markov perfect equilibrium in general-sum stochastic games.关于一般和随机博弈中马尔可夫完美均衡计算的复杂性

Natl Sci Rev. 2022 Nov 22;10(1):nwac256. doi: 10.1093/nsr/nwac256. eCollection 2023 Jan.

Multiagent Adversarial Collaborative Learning via Mean-Field Theory.多智能体对抗协同学习的平均场理论方法。

IEEE Trans Cybern. 2021 Oct;51(10):4994-5007. doi: 10.1109/TCYB.2020.3025491. Epub 2021 Oct 12.

引用本文的文献

Network Dismantling on Signed Network by Evolutionary Deep Reinforcement Learning.基于进化深度强化学习的带符号网络的网络拆解

Sensors (Basel). 2024 Dec 16;24(24):8026. doi: 10.3390/s24248026.

Adversarial Decision-Making for Moving Target Defense: A Multi-Agent Markov Game and Reinforcement Learning Approach.移动目标防御的对抗性决策：一种多智能体马尔可夫博弈与强化学习方法

Entropy (Basel). 2023 Apr 2;25(4):605. doi: 10.3390/e25040605.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于两人零和马尔可夫博弈的在线极小极大Q网络学习

Online Minimax Q Network Learning for Two-Player Zero-Sum Markov Games.

作者信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献