Xu Zhenhui, Shen Tielong, Cheng Daizhan
IEEE Trans Neural Netw Learn Syst. 2022 Apr;33(4):1520-1534. doi: 10.1109/TNNLS.2020.3042589. Epub 2022 Apr 4.
In this article, a novel integral reinforcement learning (IRL) algorithm is proposed to solve the optimal control problem for continuous-time nonlinear systems with unknown dynamics. The main challenging issue in learning is how to reject the oscillation caused by the externally added probing noise. This article challenges the issue by embedding an auxiliary trajectory that is designed as an exciting signal to learn the optimal solution. First, the auxiliary trajectory is used to decompose the state trajectory of the controlled system. Then, by using the decoupled trajectories, a model-free policy iteration (PI) algorithm is developed, where the policy evaluation step and the policy improvement step are alternated until convergence to the optimal solution. It is noted that an appropriate external input is introduced at the policy improvement step to eliminate the requirement of the input-to-state dynamics. Finally, the algorithm is implemented on the actor-critic structure. The output weights of the critic neural network (NN) and the actor NN are updated sequentially by the least-squares methods. The convergence of the algorithm and the stability of the closed-loop system are guaranteed. Two examples are given to show the effectiveness of the proposed algorithm.
本文提出了一种新颖的积分强化学习(IRL)算法,用于解决动力学未知的连续时间非线性系统的最优控制问题。学习过程中的主要挑战是如何抑制外部添加的探测噪声引起的振荡。本文通过嵌入一个辅助轨迹来解决这个问题,该辅助轨迹被设计为一个激励信号以学习最优解。首先,辅助轨迹用于分解受控系统的状态轨迹。然后,利用解耦后的轨迹,开发了一种无模型策略迭代(PI)算法,其中策略评估步骤和策略改进步骤交替进行,直到收敛到最优解。需要注意的是,在策略改进步骤中引入了适当的外部输入,以消除对输入到状态动力学的要求。最后,该算法在演员-评论家结构上实现。评论家神经网络(NN)和演员NN的输出权重通过最小二乘法依次更新。保证了算法的收敛性和闭环系统的稳定性。给出了两个例子来说明所提算法的有效性。