Huang Longyang, Dong Botao, Zhang Weidong
IEEE Trans Pattern Anal Mach Intell. 2024 Aug;46(8):5260-5272. doi: 10.1109/TPAMI.2024.3364844. Epub 2024 Jul 2.
Offline reinforcement learning (RL) aims at learning an optimal policy from a static offline data set, without interacting with the environment. However, the theoretical understanding of the existing offline RL methods needs further studies, among which the conservatism of the learned Q-function and the learned policy is a major issue. In this article, we propose a simple and efficient offline RL with relaxed conservatism (ORL-RC) framework for addressing this concern by learning a Q-function that is close to the true Q-function under the learned policy. The conservatism of learned Q-functions and policies of offline RL methods is analyzed. The analysis results support that the conservatism can lead to policy performance degradation. We establish the convergence results of the proposed ORL-RC, and the bounds of learned Q-functions with and without sampling errors, respectively, suggesting that the gap between the learned Q-function and the true Q-function can be reduced by executing the conservative policy improvement. A practical implementation of ORL-RC is presented and the experimental results on the D4RL benchmark suggest that ORL-RC exhibits superior performance and substantially outperforms existing state-of-the-art offline RL methods.
离线强化学习(RL)旨在从静态离线数据集中学习最优策略,而无需与环境进行交互。然而,对现有离线RL方法的理论理解仍需进一步研究,其中学习到的Q函数和策略的保守性是一个主要问题。在本文中,我们提出了一种简单高效的具有宽松保守性的离线RL(ORL-RC)框架,通过学习一个在学习到的策略下接近真实Q函数的Q函数来解决这一问题。分析了离线RL方法学习到的Q函数和策略的保守性。分析结果表明,保守性会导致策略性能下降。我们建立了所提出的ORL-RC的收敛结果,以及分别带有和不带有采样误差的学习到的Q函数的边界,这表明通过执行保守的策略改进,可以减小学习到的Q函数与真实Q函数之间的差距。给出了ORL-RC的实际实现,并且在D4RL基准测试上的实验结果表明,ORL-RC表现出卓越的性能,并且显著优于现有的离线RL方法。