Huang Longyang, Dong Botao, Lu Jinhui, Zhang Weidong
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):17950-17964. doi: 10.1109/TNNLS.2023.3309906. Epub 2024 Dec 2.
In offline actor-critic (AC) algorithms, the distributional shift between the training data and target policy causes optimistic value estimates for out-of-distribution (OOD) actions. This leads to learned policies skewed toward OOD actions with falsely high values. The existing value-regularized offline AC algorithms address this issue by learning a conservative value function, leading to a performance drop. In this article, we propose a mild policy evaluation (MPE) by constraining the difference between the values of actions supported by the target policy and those of actions contained within the offline dataset. The convergence of the proposed MPE, the gap between the learned value function and the true one, and the suboptimality of the offline AC with MPE are analyzed, respectively. A mild offline AC (MOAC) algorithm is developed by integrating MPE into off-policy AC. Compared with existing offline AC algorithms, the value function gap of MOAC is bounded by the existence of sampling errors. Moreover, in the absence of sampling errors, the true state value function can be obtained. Experimental results on the D4RL benchmark dataset demonstrate the effectiveness of MPE and the performance superiority of MOAC compared to the state-of-the-art offline reinforcement learning (RL) algorithms.
在离线演员-评论家(AC)算法中,训练数据与目标策略之间的分布偏移会导致对分布外(OOD)动作的乐观价值估计。这会使学习到的策略偏向具有虚假高值的OOD动作。现有的价值正则化离线AC算法通过学习保守的价值函数来解决这个问题,导致性能下降。在本文中,我们提出了一种温和策略评估(MPE)方法,通过约束目标策略支持的动作值与离线数据集中包含的动作值之间的差异。分别分析了所提出的MPE的收敛性、学习到的价值函数与真实价值函数之间的差距以及带有MPE的离线AC的次优性。通过将MPE集成到离策略AC中,开发了一种温和离线AC(MOAC)算法。与现有的离线AC算法相比,MOAC的价值函数差距受采样误差存在的限制。此外,在没有采样误差的情况下,可以获得真实状态价值函数。在D4RL基准数据集上的实验结果证明了MPE的有效性以及MOAC与最先进的离线强化学习(RL)算法相比的性能优越性。