IEEE Trans Cybern. 2022 Dec;52(12):13083-13095. doi: 10.1109/TCYB.2021.3100749. Epub 2022 Nov 18.
This article proposes robust inverse Q -learning algorithms for a learner to mimic an expert's states and control inputs in the imitation learning problem. These two agents have different adversarial disturbances. To do the imitation, the learner must reconstruct the unknown expert cost function. The learner only observes the expert's control inputs and uses inverse Q -learning algorithms to reconstruct the unknown expert cost function. The inverse Q -learning algorithms are robust in that they are independent of the system model and allow for the different cost function parameters and disturbances between two agents. We first propose an offline inverse Q -learning algorithm which consists of two iterative learning loops: 1) an inner Q -learning iteration loop and 2) an outer iteration loop based on inverse optimal control. Then, based on this offline algorithm, we further develop an online inverse Q -learning algorithm such that the learner mimics the expert behaviors online with the real-time observation of the expert control inputs. This online computational method has four functional approximators: a critic approximator, two actor approximators, and a state-reward neural network (NN). It simultaneously approximates the parameters of Q -function and the learner state reward online. Convergence and stability proofs are rigorously studied to guarantee the algorithm performance.
本文提出了鲁棒逆 Q 学习算法,使学习者能够在模仿学习问题中模仿专家的状态和控制输入。这两个代理具有不同的对抗性干扰。为了进行模仿,学习者必须重建未知专家的成本函数。学习者只观察专家的控制输入,并使用逆 Q 学习算法来重建未知专家的成本函数。逆 Q 学习算法具有鲁棒性,因为它们独立于系统模型,并允许两个代理之间的成本函数参数和干扰不同。我们首先提出了一种离线逆 Q 学习算法,它由两个迭代学习循环组成:1)内部 Q 学习迭代循环和 2)基于逆最优控制的外部迭代循环。然后,基于此离线算法,我们进一步开发了一种在线逆 Q 学习算法,使学习者能够在线实时观察专家的控制输入,模仿专家的行为。这种在线计算方法有四个功能逼近器:一个评价器逼近器、两个动作逼近器和一个状态奖励神经网络(NN)。它同时在线近似 Q 函数和学习者状态奖励的参数。严格研究了收敛性和稳定性证明,以保证算法性能。