Rathnam Sarah, Parbhoo Sonali, Swaroop Siddharth, Pan Weiwei, Murphy Susan A, Doshi-Velez Finale
John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138 USA.
Imperial College London, London SW7 2BX, UK.
J Mach Learn Res. 2024;25.
Discount regularization, using a shorter planning horizon when calculating the optimal policy, is a popular choice to avoid overfitting when faced with sparse or noisy data. It is commonly interpreted as de-emphasizing or ignoring delayed effects. In this paper, we prove two alternative views of discount regularization that expose unintended consequences and motivate novel regularization methods. In model-based RL, planning under a lower discount factor acts like a prior with stronger regularization on state-action pairs with more transition data. This leads to poor performance when the transition matrix is estimated from data sets with uneven amounts of data across state-action pairs. In model-free RL, discount regularization equates to planning using a weighted average Bellman update, where the agent plans as if the values of all state-action pairs are closer than implied by the data. Our equivalence theorems motivate simple methods that generalize discount regularization by setting parameters locally for individual state-action pairs rather than globally. We demonstrate the failures of discount regularization and how we remedy them using our state-action-specific methods across empirical examples with both tabular and continuous state spaces.
折扣正则化是在计算最优策略时使用较短的规划时域,这是在面对稀疏或噪声数据时避免过拟合的一种常用选择。它通常被解释为淡化或忽略延迟效应。在本文中,我们证明了折扣正则化的两种不同观点,揭示了意想不到的后果,并激发了新的正则化方法。在基于模型的强化学习中,在较低折扣因子下进行规划类似于对具有更多转移数据的状态-动作对进行更强正则化的先验。当从状态-动作对数据量不均衡的数据集中估计转移矩阵时,这会导致性能不佳。在无模型强化学习中,折扣正则化等同于使用加权平均贝尔曼更新进行规划,即智能体进行规划时就好像所有状态-动作对的值比数据所暗示的更接近。我们的等价定理激发了简单的方法,通过为单个状态-动作对局部设置参数而不是全局设置参数来推广折扣正则化。我们通过表格和连续状态空间的实证示例展示了折扣正则化的失败情况以及我们如何使用特定于状态-动作的方法来补救这些情况。