Lin Ziyu, Duan Jingliang, Li Shengbo Eben, Ma Haitong, Li Jie, Chen Jianyu, Cheng Bo, Ma Jun
IEEE Trans Neural Netw Learn Syst. 2023 Sep;34(9):5255-5267. doi: 10.1109/TNNLS.2022.3225090. Epub 2023 Sep 1.
The Hamilton-Jacobi-Bellman (HJB) equation serves as the necessary and sufficient condition for the optimal solution to the continuous-time (CT) optimal control problem (OCP). Compared with the infinite-horizon HJB equation, the solving of the finite-horizon (FH) HJB equation has been a long-standing challenge, because the partial time derivative of the value function is involved as an additional unknown term. To address this problem, this study first-time bridges the link between the partial time derivative and the terminal-time utility function, and thus it facilitates the use of the policy iteration (PI) technique to solve the CT FH OCPs. Based on this key finding, the FH approximate dynamic programming (ADP) algorithm is proposed leveraging an actor-critic framework. It is shown that the algorithm exhibits important properties in terms of convergence and optimality. Rather importantly, with the use of multilayer neural networks (NNs) in the actor-critic architecture, the algorithm is suitable for CT FH OCPs toward more general nonlinear and complex systems. Finally, the effectiveness of the proposed algorithm is demonstrated by conducting a series of simulations on both a linear quadratic regulator (LQR) problem and a nonlinear vehicle tracking problem.
哈密顿 - 雅可比 - 贝尔曼(HJB)方程是连续时间(CT)最优控制问题(OCP)最优解的充要条件。与无限时域HJB方程相比,有限时域(FH)HJB方程的求解一直是一个长期存在的挑战,因为价值函数的偏时间导数作为一个额外的未知项被涉及。为了解决这个问题,本研究首次在偏时间导数和终端时间效用函数之间建立了联系,从而便于使用策略迭代(PI)技术来求解CT FH OCP。基于这一关键发现,利用演员 - 评论家框架提出了FH近似动态规划(ADP)算法。结果表明,该算法在收敛性和最优性方面具有重要性质。更重要的是,通过在演员 - 评论家架构中使用多层神经网络(NN),该算法适用于更一般的非线性和复杂系统的CT FH OCP。最后,通过对线性二次调节器(LQR)问题和非线性车辆跟踪问题进行一系列仿真,验证了所提算法的有效性。