• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于二叉树状态空间分解的分层近似策略迭代

Hierarchical approximate policy iteration with binary-tree state space decomposition.

作者信息

Xu Xin, Liu Chunming, Yang Simon X, Hu Dewen

机构信息

College of Mechatronics and Automation, National University of Defense Technology, Changsha 410073, China.

出版信息

IEEE Trans Neural Netw. 2011 Dec;22(12):1863-77. doi: 10.1109/TNN.2011.2168422. Epub 2011 Oct 10.

DOI:10.1109/TNN.2011.2168422
PMID:21990333
Abstract

In recent years, approximate policy iteration (API) has attracted increasing attention in reinforcement learning (RL), e.g., least-squares policy iteration (LSPI) and its kernelized version, the kernel-based LSPI algorithm. However, it remains difficult for API algorithms to obtain near-optimal policies for Markov decision processes (MDPs) with large or continuous state spaces. To address this problem, this paper presents a hierarchical API (HAPI) method with binary-tree state space decomposition for RL in a class of absorbing MDPs, which can be formulated as time-optimal learning control tasks. In the proposed method, after collecting samples adaptively in the state space of the original MDP, a learning-based decomposition strategy of sample sets was designed to implement the binary-tree state space decomposition process. Then, API algorithms were used on the sample subsets to approximate local optimal policies of sub-MDPs. The original MDP was decomposed into a binary-tree structure of absorbing sub-MDPs, constructed during the learning process, thus, local near-optimal policies were approximated by API algorithms with reduced complexity and higher precision. Furthermore, because of the improved quality of local policies, the combined global policy performed better than the near-optimal policy obtained by a single API algorithm in the original MDP. Three learning control problems, including path-tracking control of a real mobile robot, were studied to evaluate the performance of the HAPI method. With the same setting for basis function selection and sample collection, the proposed HAPI obtained better near-optimal policies than previous API methods such as LSPI and KLSPI.

摘要

近年来,近似策略迭代(API)在强化学习(RL)中受到越来越多的关注,例如最小二乘策略迭代(LSPI)及其核化版本,基于核的LSPI算法。然而,对于具有大状态空间或连续状态空间的马尔可夫决策过程(MDP),API算法仍然难以获得接近最优的策略。为了解决这个问题,本文提出了一种用于一类吸收MDP中RL的具有二叉树状态空间分解的分层API(HAPI)方法,其可以被表述为时间最优学习控制任务。在所提出的方法中,在原始MDP的状态空间中自适应地收集样本之后,设计了一种基于学习的样本集分解策略来实现二叉树状态空间分解过程。然后,在样本子集上使用API算法来近似子MDP的局部最优策略。原始MDP被分解为在学习过程中构建的吸收子MDP的二叉树结构,因此,通过API算法以降低的复杂度和更高的精度近似局部接近最优策略。此外,由于局部策略质量的提高,组合的全局策略比在原始MDP中由单个API算法获得的接近最优策略表现更好。研究了包括真实移动机器人的路径跟踪控制在内的三个学习控制问题,以评估HAPI方法的性能。在基函数选择和样本收集的相同设置下,所提出的HAPI比诸如LSPI和KLSPI等先前的API方法获得了更好的接近最优策略。

相似文献

1
Hierarchical approximate policy iteration with binary-tree state space decomposition.基于二叉树状态空间分解的分层近似策略迭代
IEEE Trans Neural Netw. 2011 Dec;22(12):1863-77. doi: 10.1109/TNN.2011.2168422. Epub 2011 Oct 10.
2
Kernel-based least squares policy iteration for reinforcement learning.用于强化学习的基于核的最小二乘策略迭代
IEEE Trans Neural Netw. 2007 Jul;18(4):973-92. doi: 10.1109/TNN.2007.899161.
3
A clustering-based graph Laplacian framework for value function approximation in reinforcement learning.基于聚类的图拉普拉斯强化学习中值函数逼近框架。
IEEE Trans Cybern. 2014 Dec;44(12):2613-25. doi: 10.1109/TCYB.2014.2311578. Epub 2014 Apr 25.
4
Partially observable Markov decision processes and performance sensitivity analysis.部分可观测马尔可夫决策过程与性能灵敏度分析。
IEEE Trans Syst Man Cybern B Cybern. 2008 Dec;38(6):1645-51. doi: 10.1109/TSMCB.2008.927711.
5
Intelligent control of a sensor-actuator system via kernelized least-squares policy iteration.基于核最小二乘策略迭代的传感器执行器系统智能控制。
Sensors (Basel). 2012;12(3):2632-53. doi: 10.3390/s120302632. Epub 2012 Feb 28.
6
Optimal tracking control for a class of nonlinear discrete-time systems with time delays based on heuristic dynamic programming.基于启发式动态规划的一类具有时滞的非线性离散时间系统的最优跟踪控制
IEEE Trans Neural Netw. 2011 Dec;22(12):1851-62. doi: 10.1109/TNN.2011.2172628. Epub 2011 Nov 1.
7
Manifold-Based Reinforcement Learning via Locally Linear Reconstruction.基于流形的局部线性重构强化学习。
IEEE Trans Neural Netw Learn Syst. 2017 Apr;28(4):934-947. doi: 10.1109/TNNLS.2015.2505084. Epub 2016 Jan 27.
8
Optimal control in microgrid using multi-agent reinforcement learning.微电网中的多智能体强化学习最优控制。
ISA Trans. 2012 Nov;51(6):743-51. doi: 10.1016/j.isatra.2012.06.010. Epub 2012 Jul 21.
9
Efficient exploration through active learning for value function approximation in reinforcement learning.强化学习中基于主动学习的价值函数逼近的有效探索。
Neural Netw. 2010 Jun;23(5):639-48. doi: 10.1016/j.neunet.2009.12.010. Epub 2010 Jan 11.
10
Semi-supervised learning for tree-structured ensembles of RBF networks with Co-Training.基于协同训练的 RBF 网络树状集成的半监督学习。
Neural Netw. 2010 May;23(4):497-509. doi: 10.1016/j.neunet.2009.09.001. Epub 2009 Sep 17.