Luo Biao, Yang Yin, Liu Derong
IEEE Trans Cybern. 2021 Jul;51(7):3630-3640. doi: 10.1109/TCYB.2020.2970969. Epub 2021 Jun 23.
In this article, the data-based two-player zero-sum game problem is considered for linear discrete-time systems. This problem theoretically depends on solving the discrete-time game algebraic Riccati equation (DTGARE), while it requires complete system dynamics. To avoid solving the DTGARE, the Q -function is introduced and a data-based policy iteration Q -learning (PIQL) algorithm is developed to learn the optimal Q -function by using data collected from the real system. Writing the Q -function in a quadratic form, it is proved that the PIQL algorithm is equivalent to the Newton iteration method in the Banach space by using the Fréchet derivative. Then, the convergence of the PIQL algorithm can be guaranteed by Kantorovich's theorem. For the realization of the PIQL algorithm, the off-policy learning scheme is proposed using real data rather than the system model. Finally, the efficiency of the developed data-based PIQL method is validated through simulation studies.
在本文中,考虑了线性离散时间系统基于数据的两人零和博弈问题。该问题理论上依赖于求解离散时间博弈代数黎卡提方程(DTGARE),但这需要完整的系统动态特性。为避免求解DTGARE,引入了Q函数,并开发了一种基于数据的策略迭代Q学习(PIQL)算法,通过使用从实际系统收集的数据来学习最优Q函数。将Q函数写成二次形式,利用弗雷歇导数证明了PIQL算法在巴拿赫空间中与牛顿迭代法等价。然后,通过康托罗维奇定理可以保证PIQL算法的收敛性。为实现PIQL算法,提出了使用实际数据而非系统模型的离策略学习方案。最后,通过仿真研究验证了所开发的基于数据的PIQL方法的有效性。