Humayoo Mahammad, Zheng Gengzhong, Dong Xiaoqing, Miao Liming, Qiu Shuwei, Zhou Zexun, Wang Peitao, Ullah Zakir, Junejo Naveed Ur Rehman, Cheng Xueqi
Hanshan Normal University, Chaozhou, 521041, China.
CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, CAS, Beijing, 100190, China.
Sci Rep. 2025 Apr 24;15(1):14349. doi: 10.1038/s41598-025-96201-5.
Off-policy learning exhibits greater instability when compared to on-policy learning in reinforcement learning (RL). The difference in probability distribution between the target policy (π) and the behavior policy (b) is a major cause of instability. High variance also originates from distributional mismatch. The variation between the target policy's distribution and the behavior policy's distribution can be reduced using importance sampling (IS). However, importance sampling has high variance, which is exacerbated in sequential scenarios. We propose a smooth form of importance sampling, specifically relative importance sampling (RIS), which mitigates variance and stabilizes learning. To control variance, we alter the value of the smoothness parameter [Formula: see text] in RIS. We develop the first model-free relative importance sampling off-policy actor-critic (RIS-off-PAC) algorithms in RL using this strategy. Our method uses a network to generate the target policy (actor) and evaluate the current policy (π) using a value function (critic) based on behavior policy samples. Our algorithms are trained using behavior policy action values in the reward function, not target policy ones. Both the actor and critic are trained using deep neural networks. Our methods performed better than or equal to several state-of-the-art RL benchmarks on OpenAI Gym challenges and synthetic datasets.
与强化学习(RL)中的策略学习相比,离策略学习表现出更大的不稳定性。目标策略(π)和行为策略(b)之间概率分布的差异是不稳定性的主要原因。高方差也源于分布不匹配。使用重要性采样(IS)可以减少目标策略分布与行为策略分布之间的差异。然而,重要性采样具有高方差,在顺序场景中会加剧。我们提出了一种平滑形式的重要性采样,即相对重要性采样(RIS),它可以减轻方差并稳定学习。为了控制方差,我们在RIS中改变平滑度参数[公式:见正文]的值。我们使用此策略开发了RL中第一个无模型相对重要性采样离策略演员-评论家(RIS-off-PAC)算法。我们的方法使用一个网络来生成目标策略(演员),并基于行为策略样本使用价值函数(评论家)评估当前策略(π)。我们的算法在奖励函数中使用行为策略动作值进行训练,而不是目标策略动作值。演员和评论家都使用深度神经网络进行训练。在OpenAI Gym挑战和合成数据集上,我们的方法表现优于或等同于几个最先进的RL基准。