Zhao Mingming, Wang Ding, Song Shijie, Qiao Junfei
IEEE Trans Cybern. 2025 Jul;55(7):3511-3524. doi: 10.1109/TCYB.2025.3562172.
In this article, an accelerated value iteration-based safe Q-learning (SQL) algorithm is developed to design the tracking controller for unknown nonlinear systems. First, an augmented Q-function, consisting of a quadratic utility function and an adjustable positive-definite control barrier function (CBF), is devised to ensure both the optimality and safety of the tracking controller. The quadratic utility function, associated with optimality, guarantees that the tracking controller can eliminate the ultimate tracking error, regardless of the reference trajectory. The adjustable positive-definite CBF, pertaining to safety, ensures that the tracking error converges faster toward zero while remaining within the safe set at all times. Second, an accelerated iterative learning mechanism, comprising policy evaluation (PE) and policy improvement (PI), is employed to discover the safe optimal tracking control policy. Integrating the difference between two iterative Q-functions into the current PE process can expedite the convergence rate of the SQL algorithm. A policy optimization technique based on Nesterov Momentum method is utilized to accelerate the PI process of the SQL algorithm. When faced with a large amount of offline data, the two-stage accelerated learning effectively reduces computational pressure. Furthermore, convergence of the Q-function sequence and safety of the optimal tracking policy are theoretically analyzed. Finally, by using neural networks and the action-critic structure, two simulation examples are performed to verify the availability of accelerated SQL methods.
在本文中,开发了一种基于加速值迭代的安全Q学习(SQL)算法,用于设计未知非线性系统的跟踪控制器。首先,设计了一种由二次效用函数和可调正定控制障碍函数(CBF)组成的增强Q函数,以确保跟踪控制器的最优性和安全性。与最优性相关的二次效用函数保证了跟踪控制器能够消除最终跟踪误差,而不管参考轨迹如何。与安全性相关的可调正定CBF确保跟踪误差在始终保持在安全集内的同时更快地收敛到零。其次,采用一种由策略评估(PE)和策略改进(PI)组成的加速迭代学习机制来发现安全最优跟踪控制策略。将两个迭代Q函数之间的差异整合到当前的PE过程中,可以加快SQL算法的收敛速度。利用基于Nesterov动量法的策略优化技术来加速SQL算法的PI过程。当面对大量离线数据时,两阶段加速学习有效地降低了计算压力。此外,还对Q函数序列的收敛性和最优跟踪策略的安全性进行了理论分析。最后,通过使用神经网络和动作-评论家结构,进行了两个仿真例子来验证加速SQL方法的有效性。