Wen Guoxing, Chen C L Philip
IEEE Trans Neural Netw Learn Syst. 2023 Mar;34(3):1524-1536. doi: 10.1109/TNNLS.2021.3105548. Epub 2023 Feb 28.
In this article, an optimized leader-following consensus control scheme is proposed for the nonlinear strict-feedback-dynamic multi-agent system by learning from the controlling idea of optimized backstepping technique, which designs the virtual and actual controls of backstepping to be the optimized solution of corresponding subsystems so that the entire backstepping control is optimized. Since this control needs to not only ensure the optimizing system performance but also synchronize the multiple system state variables, it is an interesting and challenging topic. In order to achieve this optimized control, the neural network approximation-based reinforcement learning (RL) is performed under critic-actor architecture. In most of the existing RL-based optimal controls, since both the critic and actor RL updating laws are derived from the negative gradient of square of the Hamilton-Jacobi-Bellman (HJB) equation's approximation, which contains multiple nonlinear terms, their algorithm are inevitably intricate. However, the proposed optimized control derives the RL updating laws from the negative gradient of a simple positive function, which is correlated with the HJB equation; hence, it can be significantly simple in the algorithm. Meanwhile, it can also release two general conditions, known dynamic and persistence excitation, which are required in most of the RL-based optimal controls. Therefore, the proposed optimized scheme can be a natural selection for the high-order nonlinear multi-agent control. Finally, the effectiveness is demonstrated by both theory and simulation.
本文借鉴优化反步法技术的控制思想,针对非线性严格反馈动态多智能体系统提出了一种优化的领导者跟随一致性控制方案,该方案将反步法的虚拟控制和实际控制设计为相应子系统的优化解,从而实现整个反步法控制的优化。由于这种控制不仅需要确保系统性能的优化,还需要同步多个系统状态变量,因此这是一个有趣且具有挑战性的课题。为了实现这种优化控制,在批评者-行动者架构下进行基于神经网络逼近的强化学习(RL)。在大多数现有的基于RL的最优控制中,由于批评者和行动者RL更新律均源自哈密顿-雅可比-贝尔曼(HJB)方程近似的平方的负梯度,其中包含多个非线性项,其算法不可避免地复杂。然而,所提出的优化控制从一个与HJB方程相关的简单正函数的负梯度导出RL更新律,因此其算法可以显著简化。同时,它还可以放宽大多数基于RL的最优控制中所需的两个一般条件,即已知动态和持续激励。因此,所提出的优化方案可以成为高阶非线性多智能体控制的自然选择。最后,通过理论和仿真验证了其有效性。