IEEE Trans Neural Netw Learn Syst. 2016 Apr;27(4):771-82. doi: 10.1109/TNNLS.2015.2424233. Epub 2015 May 1.
A least squares temporal difference with gradient correction (LS-TDC) algorithm and its kernel-based version kernel-based LS-TDC (KLS-TDC) are proposed as policy evaluation algorithms for reinforcement learning (RL). LS-TDC is derived from the TDC algorithm. Attributed to TDC derived by minimizing the mean-square projected Bellman error, LS-TDC has better convergence performance. The least squares technique is used to omit the size-step tuning of the original TDC and enhance robustness. For KLS-TDC, since the kernel method is used, feature vectors can be selected automatically. The approximate linear dependence analysis is performed to realize kernel sparsification. In addition, a policy iteration strategy motivated by KLS-TDC is constructed to solve control learning problems. The convergence and parameter sensitivities of both LS-TDC and KLS-TDC are tested through on-policy learning, off-policy learning, and control learning problems. Experimental results, as compared with a series of corresponding RL algorithms, demonstrate that both LS-TDC and KLS-TDC have better approximation and convergence performance, higher efficiency for sample usage, smaller burden of parameter tuning, and less sensitivity to parameters.
提出了最小二乘时间差分与梯度校正(LS-TDC)算法及其基于核的版本核最小二乘时间差分(KLS-TDC)作为强化学习(RL)的策略评估算法。LS-TDC 是从 TDC 算法中推导出的。由于 TDC 通过最小化均方投影贝尔曼误差来导出,LS-TDC 具有更好的收敛性能。最小二乘法用于省略原始 TDC 的大小步长调整,从而提高鲁棒性。对于 KLS-TDC,由于使用了核方法,因此可以自动选择特征向量。通过进行近似线性相关性分析来实现核稀疏化。此外,基于 KLS-TDC 构建了一种策略迭代策略,以解决控制学习问题。通过在线策略学习、离线策略学习和控制学习问题来测试 LS-TDC 和 KLS-TDC 的收敛性和参数敏感性。与一系列相应的 RL 算法进行的实验结果表明,LS-TDC 和 KLS-TDC 都具有更好的逼近和收敛性能,对样本使用效率更高,参数调整负担更小,对参数的敏感性更低。