Lan Yixing, Xu Xin, Fang Qiang, Hao Jianye
IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16574-16588. doi: 10.1109/TNNLS.2023.3296642. Epub 2024 Oct 29.
Deep reinforcement learning (RL) typically requires a tremendous number of training samples, which are not practical in many applications. State abstraction and world models are two promising approaches for improving sample efficiency in deep RL. However, both state abstraction and world models may degrade the learning performance. In this article, we propose an abstracted model-based policy learning (AMPL) algorithm, which improves the sample efficiency of deep RL. In AMPL, a novel state abstraction method via multistep bisimulation is first developed to learn task-related latent state spaces. Hence, the original Markov decision processes (MDPs) are compressed into abstracted MDPs. Then, a causal transformer model predictor (CTMP) is designed to approximate the abstracted MDPs and generate long-horizon simulated trajectories with a smaller multistep prediction error. Policies are efficiently learned through these trajectories within the abstracted MDPs via a modified multistep soft actor-critic algorithm with a λ -target. Moreover, theoretical analysis shows that the AMPL algorithm can improve sample efficiency during the training process. On Atari games and the DeepMind Control (DMControl) suite, AMPL surpasses current state-of-the-art deep RL algorithms in terms of sample efficiency. Furthermore, DMControl tasks with moving noises are conducted, and the results demonstrate that AMPL is robust to task-irrelevant observational distractors and significantly outperforms the existing approaches.
深度强化学习(RL)通常需要大量的训练样本,这在许多应用中并不实际。状态抽象和世界模型是提高深度RL样本效率的两种有前景的方法。然而,状态抽象和世界模型都可能降低学习性能。在本文中,我们提出了一种基于抽象模型的策略学习(AMPL)算法,该算法提高了深度RL的样本效率。在AMPL中,首先开发了一种通过多步双模拟的新颖状态抽象方法来学习与任务相关的潜在状态空间。因此,原始的马尔可夫决策过程(MDP)被压缩为抽象的MDP。然后,设计了一种因果变压器模型预测器(CTMP)来近似抽象的MDP,并生成具有较小多步预测误差的长视界模拟轨迹。通过具有λ目标的改进多步软演员-评论家算法,在抽象的MDP内通过这些轨迹有效地学习策略。此外,理论分析表明,AMPL算法可以在训练过程中提高样本效率。在雅达利游戏和深度思维控制(DMControl)套件上,AMPL在样本效率方面超过了当前最先进的深度RL算法。此外,还进行了带有移动噪声的DMControl任务,结果表明AMPL对与任务无关的观测干扰具有鲁棒性,并且明显优于现有方法。