Suppr超能文献

用于两人零和马尔可夫博弈的在线极小极大Q网络学习

Online Minimax Q Network Learning for Two-Player Zero-Sum Markov Games.

作者信息

Zhu Yuanheng, Zhao Dongbin

出版信息

IEEE Trans Neural Netw Learn Syst. 2022 Mar;33(3):1228-1241. doi: 10.1109/TNNLS.2020.3041469. Epub 2022 Feb 28.

Abstract

The Nash equilibrium is an important concept in game theory. It describes the least exploitability of one player from any opponents. We combine game theory, dynamic programming, and recent deep reinforcement learning (DRL) techniques to online learn the Nash equilibrium policy for two-player zero-sum Markov games (TZMGs). The problem is first formulated as a Bellman minimax equation, and generalized policy iteration (GPI) provides a double-loop iterative way to find the equilibrium. Then, neural networks are introduced to approximate Q functions for large-scale problems. An online minimax Q network learning algorithm is proposed to train the network with observations. Experience replay, dueling network, and double Q-learning are applied to improve the learning process. The contributions are twofold: 1) DRL techniques are combined with GPI to find the TZMG Nash equilibrium for the first time and 2) the convergence of the online learning algorithm with a lookup table and experience replay is proven, whose proof is not only useful for TZMGs but also instructive for single-agent Markov decision problems. Experiments on different examples validate the effectiveness of the proposed algorithm on TZMG problems.

摘要

纳什均衡是博弈论中的一个重要概念。它描述了一个玩家相对于任何对手而言被利用的可能性最小的情况。我们将博弈论、动态规划和近期的深度强化学习(DRL)技术相结合,以在线学习两人零和马尔可夫博弈(TZMG)的纳什均衡策略。该问题首先被表述为一个贝尔曼极小极大方程,广义策略迭代(GPI)提供了一种双循环迭代方法来找到均衡。然后,引入神经网络来近似大规模问题的Q函数。提出了一种在线极小极大Q网络学习算法,用于根据观测值训练网络。经验回放、对决网络和双Q学习被应用于改进学习过程。贡献有两方面:1)首次将DRL技术与GPI相结合来找到TZMG的纳什均衡;2)证明了带有查找表和经验回放的在线学习算法的收敛性,该证明不仅对TZMG有用,而且对单智能体马尔可夫决策问题也具有指导意义。在不同示例上的实验验证了所提算法在TZMG问题上的有效性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验