基于生成模型的线性参数化马尔可夫决策过程的样本高效强化学习

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model.

作者信息

Wang Bingyan, Yan Yuling, Fan Jianqing

机构信息

Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA.

出版信息

Adv Neural Inf Process Syst. 2021 Dec;34:16671-16685.

PMID:36168331

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9512142/

Abstract

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space and the action space are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with , which can be prohibitively large when or is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an -optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of , up to some logarithmic factor. Here is the feature dimension and ∈ (0, 1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when is relatively small, and hence the title of this paper.

摘要

维度诅咒是强化学习（RL）中一个广为人知的问题。在状态空间和动作空间均为有限的表格设置中，为了通过对生成模型进行采样访问来获得近似最优策略，极小极大最优样本复杂度与成线性比例，当或很大时，这可能会大到令人望而却步。本文考虑一个马尔可夫决策过程（MDP），它允许一组状态 - 动作特征，这些特征可以线性表示（或近似）其概率转移核。我们表明，基于模型的方法（相应地，Q学习）一旦样本大小超过的阶数，在存在一些对数因子的情况下，以高概率可证明地学习到一个 - 最优策略（相应地，Q函数）。这里是特征维度， ∈ (0, 1) 是MDP的折扣因子。两个样本复杂度界都被证明是紧的，并且我们基于模型的方法的结果与极小极大下界相匹配。我们的结果表明，对于任意大规模的MDP，当相对较小时，基于模型的方法和Q学习都是样本高效的，因此本文才有此标题。

相似文献

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model.基于生成模型的线性参数化马尔可夫决策过程的样本高效强化学习

Adv Neural Inf Process Syst. 2021 Dec;34:16671-16685.

Hierarchical approximate policy iteration with binary-tree state space decomposition.基于二叉树状态空间分解的分层近似策略迭代

IEEE Trans Neural Netw. 2011 Dec;22(12):1863-77. doi: 10.1109/TNN.2011.2168422. Epub 2011 Oct 10.

Scaling Up Q-Learning via Exploiting State-Action Equivalence.通过利用状态-动作等价性扩展Q学习

Entropy (Basel). 2023 Mar 29;25(4):584. doi: 10.3390/e25040584.

On Practical Robust Reinforcement Learning: Adjacent Uncertainty Set and Double-Agent Algorithm.论实用稳健强化学习：相邻不确定性集与双智能体算法

IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):7696-7710. doi: 10.1109/TNNLS.2024.3385234. Epub 2025 Apr 4.

MOO-MDP: An Object-Oriented Representation for Cooperative Multiagent Reinforcement Learning.MOO-MDP：面向协同多智能体强化学习的面向对象表示。

IEEE Trans Cybern. 2019 Feb;49(2):567-579. doi: 10.1109/TCYB.2017.2781130. Epub 2017 Dec 28.

Parameterized MDPs and Reinforcement Learning Problems-A Maximum Entropy Principle-Based Framework.参数化马尔可夫决策过程和强化学习问题——基于最大熵原理的框架。

IEEE Trans Cybern. 2022 Sep;52(9):9339-9351. doi: 10.1109/TCYB.2021.3102510. Epub 2022 Aug 18.

Kernel-based least squares policy iteration for reinforcement learning.用于强化学习的基于核的最小二乘策略迭代

IEEE Trans Neural Netw. 2007 Jul;18(4):973-92. doi: 10.1109/TNN.2007.899161.

An immediate-return reinforcement learning for the atypical Markov decision processes.针对非典型马尔可夫决策过程的即时回报强化学习。

Front Neurorobot. 2022 Dec 13;16:1012427. doi: 10.3389/fnbot.2022.1012427. eCollection 2022.

Improvement of Reinforcement Learning With Supermodularity.基于超模性的强化学习改进

IEEE Trans Neural Netw Learn Syst. 2023 Sep;34(9):5298-5309. doi: 10.1109/TNNLS.2023.3244024. Epub 2023 Sep 1.

A Maximum Divergence Approach to Optimal Policy in Deep Reinforcement Learning.深度强化学习中最优策略的最大散度方法。

IEEE Trans Cybern. 2023 Mar;53(3):1499-1510. doi: 10.1109/TCYB.2021.3104612. Epub 2023 Feb 15.

本文引用的文献

Convex and Nonconvex Optimization Are Both Minimax-Optimal for Noisy Blind Deconvolution under Random Designs.在随机设计下，凸优化和非凸优化对于噪声盲反卷积均为极小极大最优。

J Am Stat Assoc. 2023;118(542):858-868. doi: 10.1080/01621459.2021.1956501. Epub 2021 Sep 24.

BRIDGING CONVEX AND NONCONVEX OPTIMIZATION IN ROBUST PCA: NOISE, OUTLIERS, AND MISSING DATA.稳健主成分分析中凸优化与非凸优化的桥梁：噪声、离群值与缺失数据

Ann Stat. 2021 Oct;49(5):2948-2971. doi: 10.1214/21-aos2066. Epub 2021 Nov 12.

NOISY MATRIX COMPLETION: UNDERSTANDING STATISTICAL GUARANTEES FOR CONVEX RELAXATION VIA NONCONVEX OPTIMIZATION.噪声矩阵补全：通过非凸优化理解凸松弛的统计保证

SIAM J Optim. 2020;30(4):3098-3121. doi: 10.1137/19m1290000. Epub 2020 Oct 28.

SPECTRAL METHOD AND REGULARIZED MLE ARE BOTH OPTIMAL FOR TOP- RANKING.谱方法和正则化最大似然估计在排名靠前方面都是最优的。

Ann Stat. 2019;47(4):2204-2235. doi: 10.1214/18-AOS1745. Epub 2019 May 21.

Mastering the game of Go without human knowledge.无需人类知识即可掌握围棋游戏。

Nature. 2017 Oct 18;550(7676):354-359. doi: 10.1038/nature24270.

Mastering the game of Go with deep neural networks and tree search.用深度神经网络和树搜索掌握围棋游戏。

Nature. 2016 Jan 28;529(7587):484-9. doi: 10.1038/nature16961.

On the Theory of Dynamic Programming.论动态规划理论

Proc Natl Acad Sci U S A. 1952 Aug;38(8):716-9. doi: 10.1073/pnas.38.8.716.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验