Hanselmann Thomas, Noakes Lyle, Zaknich Anthony
Department of Electrical and Electronic Engineering, the University of Melbourne, Parkville, Vic. 3010, Australia.
IEEE Trans Neural Netw. 2007 May;18(3):631-47. doi: 10.1109/TNN.2006.889499.
A continuous-time formulation of an adaptive critic design (ACD) is investigated. Connections to the discrete case are made, where backpropagation through time (BPTT) and real-time recurrent learning (RTRL) are prevalent. Practical benefits are that this framework fits in well with plant descriptions given by differential equations and that any standard integration routine with adaptive step-size does an adaptive sampling for free. A second-order actor adaptation using Newton's method is established for fast actor convergence for a general plant and critic. Also, a fast critic update for concurrent actor-critic training is introduced to immediately apply necessary adjustments of critic parameters induced by actor updates to keep the Bellman optimality correct to first-order approximation after actor changes. Thus, critic and actor updates may be performed at the same time until some substantial error build up in the Bellman optimality or temporal difference equation, when a traditional critic training needs to be performed and then another interval of concurrent actor-critic training may resume.
研究了自适应评判设计(ACD)的连续时间公式。建立了与离散情况的联系,在离散情况下,时间反向传播(BPTT)和实时递归学习(RTRL)很普遍。实际的好处是,该框架与由微分方程给出的对象描述非常契合,并且任何具有自适应步长的标准积分例程都能免费进行自适应采样。针对一般对象和评判,建立了使用牛顿法的二阶执行器自适应,以实现执行器的快速收敛。此外,还引入了用于并发执行器 - 评判训练的快速评判更新,以便立即应用由执行器更新引起的评判参数的必要调整,从而在执行器变化后使贝尔曼最优性保持在一阶近似正确。因此,评判和执行器更新可以同时进行,直到贝尔曼最优性或时间差分方程中出现一些显著的误差积累,此时需要进行传统的评判训练,然后可以恢复另一个并发执行器 - 评判训练间隔。