IEEE Trans Cybern. 2016 Nov;46(11):2401-2410. doi: 10.1109/TCYB.2015.2477810. Epub 2016 Sep 22.
A model-free off-policy reinforcement learning algorithm is developed to learn the optimal output-feedback (OPFB) solution for linear continuous-time systems. The proposed algorithm has the important feature of being applicable to the design of optimal OPFB controllers for both regulation and tracking problems. To provide a unified framework for both optimal regulation and tracking, a discounted performance function is employed and a discounted algebraic Riccati equation (ARE) is derived which gives the solution to the problem. Conditions on the existence of a solution to the discounted ARE are provided and an upper bound for the discount factor is found to assure the stability of the optimal control solution. To develop an optimal OPFB controller, it is first shown that the system state can be constructed using some limited observations on the system output over a period of the history of the system. A Bellman equation is then developed to evaluate a control policy and find an improved policy simultaneously using only some limited observations on the system output. Then, using this Bellman equation, a model-free Off-policy RL-based OPFB controller is developed without requiring the knowledge of the system state or the system dynamics. It is shown that the proposed OPFB method is more powerful than the static OPFB as it is equivalent to a state-feedback control policy. The proposed method is successfully used to solve a regulation and a tracking problem.
提出了一种无模型的离线策略强化学习算法,用于学习线性连续时间系统的最优输出反馈(OPFB)解。所提出的算法具有适用于设计调节和跟踪问题的最优 OPFB 控制器的重要特点。为了为调节和跟踪问题提供一个统一的框架,采用了折扣性能函数,并推导出了折扣代数黎卡提方程(ARE),给出了问题的解。提供了折扣 ARE 解存在性的条件,并找到了折扣因子的上限,以确保最优控制解的稳定性。为了开发最优 OPFB 控制器,首先证明可以使用系统输出在系统历史上的某个时间段内的有限个观测值来构造系统状态。然后,开发一个贝尔曼方程来同时评估控制策略和找到改进的策略,而仅使用系统输出的有限个观测值。然后,使用该贝尔曼方程,开发了一种无需系统状态或系统动态知识的基于无模型离线策略 RL 的 OPFB 控制器。结果表明,所提出的 OPFB 方法比静态 OPFB 更强大,因为它等效于状态反馈控制策略。所提出的方法成功地用于解决调节和跟踪问题。