Suppr超能文献

基于生成模型的线性参数化马尔可夫决策过程的样本高效强化学习

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model.

作者信息

Wang Bingyan, Yan Yuling, Fan Jianqing

机构信息

Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA.

出版信息

Adv Neural Inf Process Syst. 2021 Dec;34:16671-16685.

Abstract

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space and the action space are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with , which can be prohibitively large when or is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an -optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of , up to some logarithmic factor. Here is the feature dimension and ∈ (0, 1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when is relatively small, and hence the title of this paper.

摘要

维度诅咒是强化学习(RL)中一个广为人知的问题。在状态空间和动作空间均为有限的表格设置中,为了通过对生成模型进行采样访问来获得近似最优策略,极小极大最优样本复杂度与 成线性比例,当 或 很大时,这可能会大到令人望而却步。本文考虑一个马尔可夫决策过程(MDP),它允许一组状态 - 动作特征,这些特征可以线性表示(或近似)其概率转移核。我们表明,基于模型的方法(相应地,Q学习)一旦样本大小超过 的阶数,在存在一些对数因子的情况下,以高概率可证明地学习到一个 - 最优策略(相应地,Q函数)。这里 是特征维度, ∈ (0, 1) 是MDP的折扣因子。两个样本复杂度界都被证明是紧的,并且我们基于模型的方法的结果与极小极大下界相匹配。我们的结果表明,对于任意大规模的MDP,当 相对较小时,基于模型的方法和Q学习都是样本高效的,因此本文才有此标题。

相似文献

2
Hierarchical approximate policy iteration with binary-tree state space decomposition.基于二叉树状态空间分解的分层近似策略迭代
IEEE Trans Neural Netw. 2011 Dec;22(12):1863-77. doi: 10.1109/TNN.2011.2168422. Epub 2011 Oct 10.
9
Improvement of Reinforcement Learning With Supermodularity.基于超模性的强化学习改进
IEEE Trans Neural Netw Learn Syst. 2023 Sep;34(9):5298-5309. doi: 10.1109/TNNLS.2023.3244024. Epub 2023 Sep 1.
10
A Maximum Divergence Approach to Optimal Policy in Deep Reinforcement Learning.深度强化学习中最优策略的最大散度方法。
IEEE Trans Cybern. 2023 Mar;53(3):1499-1510. doi: 10.1109/TCYB.2021.3104612. Epub 2023 Feb 15.

本文引用的文献

5
Mastering the game of Go without human knowledge.无需人类知识即可掌握围棋游戏。
Nature. 2017 Oct 18;550(7676):354-359. doi: 10.1038/nature24270.
7
On the Theory of Dynamic Programming.论动态规划理论
Proc Natl Acad Sci U S A. 1952 Aug;38(8):716-9. doi: 10.1073/pnas.38.8.716.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验