Suppr超能文献

基于生成模型的线性参数化马尔可夫决策过程的样本高效强化学习

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model.

作者信息

Wang Bingyan, Yan Yuling, Fan Jianqing

机构信息

Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA.

出版信息

Adv Neural Inf Process Syst. 2021 Dec;34:16671-16685.

Abstract

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space and the action space are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with , which can be prohibitively large when or is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an -optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of , up to some logarithmic factor. Here is the feature dimension and ∈ (0, 1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when is relatively small, and hence the title of this paper.

摘要

维度诅咒是强化学习(RL)中一个广为人知的问题。在状态空间和动作空间均为有限的表格设置中,为了通过对生成模型进行采样访问来获得近似最优策略,极小极大最优样本复杂度与 成线性比例,当 或 很大时,这可能会大到令人望而却步。本文考虑一个马尔可夫决策过程(MDP),它允许一组状态 - 动作特征,这些特征可以线性表示(或近似)其概率转移核。我们表明,基于模型的方法(相应地,Q学习)一旦样本大小超过 的阶数,在存在一些对数因子的情况下,以高概率可证明地学习到一个 - 最优策略(相应地,Q函数)。这里 是特征维度, ∈ (0, 1) 是MDP的折扣因子。两个样本复杂度界都被证明是紧的,并且我们基于模型的方法的结果与极小极大下界相匹配。我们的结果表明,对于任意大规模的MDP,当 相对较小时,基于模型的方法和Q学习都是样本高效的,因此本文才有此标题。

相似文献

2
Hierarchical approximate policy iteration with binary-tree state space decomposition.基于二叉树状态空间分解的分层近似策略迭代
IEEE Trans Neural Netw. 2011 Dec;22(12):1863-77. doi: 10.1109/TNN.2011.2168422. Epub 2011 Oct 10.
9
Improvement of Reinforcement Learning With Supermodularity.基于超模性的强化学习改进
IEEE Trans Neural Netw Learn Syst. 2023 Sep;34(9):5298-5309. doi: 10.1109/TNNLS.2023.3244024. Epub 2023 Sep 1.
10
A Maximum Divergence Approach to Optimal Policy in Deep Reinforcement Learning.深度强化学习中最优策略的最大散度方法。
IEEE Trans Cybern. 2023 Mar;53(3):1499-1510. doi: 10.1109/TCYB.2021.3104612. Epub 2023 Feb 15.

本文引用的文献

5
Mastering the game of Go without human knowledge.无需人类知识即可掌握围棋游戏。
Nature. 2017 Oct 18;550(7676):354-359. doi: 10.1038/nature24270.
7
On the Theory of Dynamic Programming.论动态规划理论
Proc Natl Acad Sci U S A. 1952 Aug;38(8):716-9. doi: 10.1073/pnas.38.8.716.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验