IEEE Trans Neural Netw Learn Syst. 2013 Oct;24(10):1513-25. doi: 10.1109/TNNLS.2013.2276571.
This paper presents an online policy iteration (PI) algorithm to learn the continuous-time optimal control solution for unknown constrained-input systems. The proposed PI algorithm is implemented on an actor-critic structure where two neural networks (NNs) are tuned online and simultaneously to generate the optimal bounded control policy. The requirement of complete knowledge of the system dynamics is obviated by employing a novel NN identifier in conjunction with the actor and critic NNs. It is shown how the identifier weights estimation error affects the convergence of the critic NN. A novel learning rule is developed to guarantee that the identifier weights converge to small neighborhoods of their ideal values exponentially fast. To provide an easy-to-check persistence of excitation condition, the experience replay technique is used. That is, recorded past experiences are used simultaneously with current data for the adaptation of the identifier weights. Stability of the whole system consisting of the actor, critic, system state, and system identifier is guaranteed while all three networks undergo adaptation. Convergence to a near-optimal control law is also shown. The effectiveness of the proposed method is illustrated with a simulation example.
本文提出了一种在线策略迭代(PI)算法,用于学习未知约束输入系统的连续时间最优控制解决方案。所提出的 PI 算法是在一个演员-评论家结构上实现的,其中两个神经网络(NN)在线和同时进行调整,以生成最优的有界控制策略。通过结合演员和评论家神经网络使用新颖的 NN 标识符,避免了对系统动力学完全了解的要求。展示了标识符权重估计误差如何影响评论家 NN 的收敛性。开发了一种新的学习规则,以保证标识符权重以指数速度快速收敛到其理想值的小邻域。为了提供易于检查的持续激励条件,使用了经验重放技术。也就是说,同时使用当前数据和过去记录的经验来适应标识符权重。在所有三个网络进行自适应的同时,保证了由演员、评论家、系统状态和系统标识符组成的整个系统的稳定性。还证明了收敛到接近最优控制律。通过一个仿真示例说明了所提出方法的有效性。