Wang Bingyan, Yan Yuling, Fan Jianqing
Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA.
Adv Neural Inf Process Syst. 2021 Dec;34:16671-16685.
The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space and the action space are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with , which can be prohibitively large when or is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an -optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of , up to some logarithmic factor. Here is the feature dimension and ∈ (0, 1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when is relatively small, and hence the title of this paper.
维度诅咒是强化学习(RL)中一个广为人知的问题。在状态空间和动作空间均为有限的表格设置中,为了通过对生成模型进行采样访问来获得近似最优策略,极小极大最优样本复杂度与 成线性比例,当 或 很大时,这可能会大到令人望而却步。本文考虑一个马尔可夫决策过程(MDP),它允许一组状态 - 动作特征,这些特征可以线性表示(或近似)其概率转移核。我们表明,基于模型的方法(相应地,Q学习)一旦样本大小超过 的阶数,在存在一些对数因子的情况下,以高概率可证明地学习到一个 - 最优策略(相应地,Q函数)。这里 是特征维度, ∈ (0, 1) 是MDP的折扣因子。两个样本复杂度界都被证明是紧的,并且我们基于模型的方法的结果与极小极大下界相匹配。我们的结果表明,对于任意大规模的MDP,当 相对较小时,基于模型的方法和Q学习都是样本高效的,因此本文才有此标题。