Sun Changyin, Li Xiaofeng, Sun Yuewen
IEEE Trans Neural Netw Learn Syst. 2021 Aug;32(8):3578-3587. doi: 10.1109/TNNLS.2020.3015767. Epub 2021 Aug 3.
In this article, a model-free online adaptive dynamic programming (ADP) approach is developed for solving the optimal control problem of nonaffine nonlinear systems. Combining the off-policy learning mechanism with the parallel paradigm, multithread agents are employed to collect the transitions by interacting with the environment that significantly augments the number of sampled data. On the other hand, each thread agent explores the environment with different initial states under its own behavior policy that enhances the exploration capability and alleviates the correlation between the sampled data. After the policy evaluation process, only one step update is required for policy improvement based on the policy gradient method. The stability of the system under iterative control laws is guaranteed. Moreover, the convergence analysis is given to prove that the iterative Q-function is monotonically nonincreasing and finally converges to the solution of the Hamilton-Jacobi-Bellman (HJB) equation. For implementing the algorithm, the actor-critic (AC) structure is utilized with two neural networks (NNs) to approximate the Q-function and the control policy. Finally, the effectiveness of the proposed algorithm is verified by two numerical examples.
在本文中,为解决非仿射非线性系统的最优控制问题,开发了一种无模型在线自适应动态规划(ADP)方法。将离策略学习机制与并行范式相结合,采用多线程智能体通过与环境交互来收集转移,这显著增加了采样数据的数量。另一方面,每个线程智能体在其自身行为策略下以不同初始状态探索环境,这增强了探索能力并减轻了采样数据之间的相关性。在策略评估过程之后,基于策略梯度方法进行策略改进仅需一步更新。保证了系统在迭代控制律下的稳定性。此外,给出了收敛性分析以证明迭代Q函数单调非增并最终收敛到汉密尔顿 - 雅可比 - 贝尔曼(HJB)方程的解。为实现该算法,利用了演员 - 评论家(AC)结构以及两个神经网络(NN)来逼近Q函数和控制策略。最后,通过两个数值例子验证了所提算法的有效性。