基于二叉树状态空间分解的分层近似策略迭代

Hierarchical approximate policy iteration with binary-tree state space decomposition.

作者信息

Xu Xin, Liu Chunming, Yang Simon X, Hu Dewen

机构信息

College of Mechatronics and Automation, National University of Defense Technology, Changsha 410073, China.

出版信息

IEEE Trans Neural Netw. 2011 Dec;22(12):1863-77. doi: 10.1109/TNN.2011.2168422. Epub 2011 Oct 10.

DOI:10.1109/TNN.2011.2168422

PMID:21990333

Abstract

In recent years, approximate policy iteration (API) has attracted increasing attention in reinforcement learning (RL), e.g., least-squares policy iteration (LSPI) and its kernelized version, the kernel-based LSPI algorithm. However, it remains difficult for API algorithms to obtain near-optimal policies for Markov decision processes (MDPs) with large or continuous state spaces. To address this problem, this paper presents a hierarchical API (HAPI) method with binary-tree state space decomposition for RL in a class of absorbing MDPs, which can be formulated as time-optimal learning control tasks. In the proposed method, after collecting samples adaptively in the state space of the original MDP, a learning-based decomposition strategy of sample sets was designed to implement the binary-tree state space decomposition process. Then, API algorithms were used on the sample subsets to approximate local optimal policies of sub-MDPs. The original MDP was decomposed into a binary-tree structure of absorbing sub-MDPs, constructed during the learning process, thus, local near-optimal policies were approximated by API algorithms with reduced complexity and higher precision. Furthermore, because of the improved quality of local policies, the combined global policy performed better than the near-optimal policy obtained by a single API algorithm in the original MDP. Three learning control problems, including path-tracking control of a real mobile robot, were studied to evaluate the performance of the HAPI method. With the same setting for basis function selection and sample collection, the proposed HAPI obtained better near-optimal policies than previous API methods such as LSPI and KLSPI.

摘要

近年来，近似策略迭代（API）在强化学习（RL）中受到越来越多的关注，例如最小二乘策略迭代（LSPI）及其核化版本，基于核的LSPI算法。然而，对于具有大状态空间或连续状态空间的马尔可夫决策过程（MDP），API算法仍然难以获得接近最优的策略。为了解决这个问题，本文提出了一种用于一类吸收MDP中RL的具有二叉树状态空间分解的分层API（HAPI）方法，其可以被表述为时间最优学习控制任务。在所提出的方法中，在原始MDP的状态空间中自适应地收集样本之后，设计了一种基于学习的样本集分解策略来实现二叉树状态空间分解过程。然后，在样本子集上使用API算法来近似子MDP的局部最优策略。原始MDP被分解为在学习过程中构建的吸收子MDP的二叉树结构，因此，通过API算法以降低的复杂度和更高的精度近似局部接近最优策略。此外，由于局部策略质量的提高，组合的全局策略比在原始MDP中由单个API算法获得的接近最优策略表现更好。研究了包括真实移动机器人的路径跟踪控制在内的三个学习控制问题，以评估HAPI方法的性能。在基函数选择和样本收集的相同设置下，所提出的HAPI比诸如LSPI和KLSPI等先前的API方法获得了更好的接近最优策略。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于二叉树状态空间分解的分层近似策略迭代

Hierarchical approximate policy iteration with binary-tree state space decomposition.

作者信息

机构信息

出版信息

相似文献

基于二叉树状态空间分解的分层近似策略迭代

Hierarchical approximate policy iteration with binary-tree state space decomposition.

作者信息

机构信息

出版信息

相似文献